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PREFACE 


la  the  summer  of  1979,  I  spent  a  few  weeks  In  Tokyo  under  the 
sponsorship  of  the  Office  of  Naval  Research  (ONR) .  This  monograph  Is 
based  on  conferences  with  researchers  In  Japan,  In  the  areas  of  psycho¬ 
metrics,  educational  measurement,  and  educational  technologies,  and  on 
research  materials  and  technical  literature  collected  during  this  trip. 

I  thank  Or.  Rudolph  J.  Marcus,  Scientific  Director,  Miss  Eunice  Mohrl, 
and  other  ONR/Tokyo  staff  members  for  providing  me  with  office  space 
and  services,  taking  me  to  J1CST,  and  helping  me  in  many  other  ways. 

I  was  invited  to  one  of  the  bimonthly  meetings  of  the  Educational 
Technology  Group  of  the  Institute  of  Electronics  and  Communication 
Engineers  in  Japan,  which  was  held  at  the  Central  Research  Laboratories 
of  Nippon  Electric  Co.,  Ltd.,  on  23  July,  1979,  and  had  an  opportunity 
to  talk  with  the  researchers  who  came  to  the  meeting  from  many  different 
districts  of  Japan.  The  author  is  thankful  to  Dr.  Takahiro  Sato,  the 
representative  of  the  Group,  and  other  members  for  their  kind  cooperation 
in  collecting  research  materials  and  literature. 

It  was  also  a  pleasure  to  have  several  conferences  with  Dr. 
Sukeyori  Shiba,  Professor  of  Education  at  the  University  of  Tokyo  and  an 
old  friend  of  mine,  during  my  stay  in  Tokyo,  and  to  get  to  know  a  large 
scale  research  project  on  the  measurement  of  vocabulary  conducted  by  him 
and  his  students.  The  author  is  thankful  to  him  and  his  students  for 
making  copies  of  their  research  materials  and  sending  them  to  Knoxville, 
Tennessee,  after  I  returned. 
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PREFACE  (Continued) 


Because  of  the  shortage  of  time,  the  author  could  not  see  all 
the  people  she  had  wanted  to;  among  them  are  Professor  Takeuchi  of  the 
University  of  Tokyo  and  Dr.  Akaike  of  the  Institute  of  Mathematical 
Statistics,  who  happened  to  be  out  of  town  during  her  stay  in  Tokyo. 

The  stimulation  of  these  conversations,  and  of  the  research 
materials  and  literature  obtained  in  Tokyo,  started  new  trains  of 
thought  in  the  author's  mind.  Some  of  these  concern  the  multiple- 
choice  item,  which  is  the  subject  of  this  monograph.  Others  require 
yet  more  work  and  further  communication  with  Japanese  colleagues.  In 
particular,  the  author  feels  it  is  worth  trying  to  reanalyze  the  vocab¬ 
ulary  test  data  collected  by  Shiba  and  others,  using  theory  and  methods 
which  the  author  has  developed  and  is  going  to  develop. 

The  author  is  thankful  to  the  Office  of  Naval  Research  for  this 
opportunity  of  visiting  Tokyo,  and  hopes  that  the  present  report  will 
contribute  to  the  development  of  mental  test  theory  and  science  in 
general . 
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I  Introduction 

There  will  not  be  any  doubt  in  the  mind  of  psychometricians 
that  good  mental  test  items  are  informative  items,  which  make  a 
great  deal  of  contribution  to  the  estimation  of  the  examinee's 
ability,  and,  therefore,  uncover  the  individual  differences  among 
the  examinees  accurately.  In  the  history  of  mental  test  theory, 
the  multiple-choice  item  arrived  later  than  the  free-response 
item,  out  of  the  necessity  of  administering  group  tests  and  of 
scoring  their  results  speedily  and  objectively,  in  the  sense  that 
there  is  no  need  for  our  subjective  judgment  and  evaluation  in 
scoring.  Today,  an  enormous  number  of  multiple-choice  tests 
are  administered  to  youngsters,  and  their  results  have  been  used 
in  many  important  decision-making  situations,  such  as  guidance, 
selection,  classification,  and  so  on.  To  construct  good  multiple- 
choice  test  items  and  to  develop  good  mental  test  theory  which 
deals  with  the  multiple-choice  item  are,  therefore,  most  important. 

Since  the  multiple-choice  item  was  introduced  as  a  substitute 
for  the  free-response  item,  it  has  been  treated  by  mental  test 
theorists  as  something  which  is  useful  from  the  practical  point  of 
view,  but  not  quite  as  good  as  the  free-response  item.  The  three- 
parameter  logistic,  or  normal  ogive,  model,  which  is  widely  used 
by  psychologists  and  educational  psychologists  for  the  multiple- 
choice  item  today,  is  nothing  but  a  "blurred"  image  of  the  logistic, 
or  normal  ogive,  model  for  the  free-response  item.  In  other  words, 
there  is  nothing  meaningful  which  is  added  to  the  original  logistic, 
or  normal  ogive,  model,  but  there  are  additional  noises  caused 
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by  random  guessing  in  the  three-parameter  logistic,  or  normal 
ogive  model. 

We  must  stop  and  think,  however,  if  the  three-parameter 
logistic,  or  normal  ogive,  model  really  fits  psychological  reality, 
and  if  the  multiple-choice  test  item  cannot  be  more  than  a  "blurred” 
image  of  the  free-response  item.  The  author's  answer  to  the  first 
question  is  negative,  to  the  second  positive.  It  is  clear  in  the 
author's  mind  that  we  need  a  better  model  than  the  three-parameter 
logistic,  or  normal  ogive,  model  for  the  multiple-choice  item,  and 
that  the  multiple-choice  item  can  provide  us  with  a  larger  amount 
of  information  which  results  in  a  more  accurate  ability  estimation, 
if  we  make  use  of  the  information  given  by  its  distractors,  which 
the  free-response  item  does  not  have. 

It  was  interesting  to  discover  that,  while  very  few  researchers 
in  the  United  States  have  ever  questioned  the  appropriateness  of  the 
three-parameter  logistic,  or  normal  ogive,  model  for  the  multiple- 
choice  item,  and  have  tried  to  validade  it  for  their  research  data, 
the  author's  perception  is  shared  by  some  Japanese  resetu chers. 

Some  of  these  are  members  of  a  nation-wide  research  group  called 
the  Educational  Technology  Group  of  the  Institute  of  Electronics 
and  Communication  Engineers  in  Japan.  Most  of  the  members  of  the 
group  are  engineers  in  computer  science,  and  some  of  them  are 
educational  psychologists.  Tatsuoka  has  reported  their  names 
and  research  activities  (Tatsuoka,  1979),  which  are  represented 
by  such  topics  as  the  S-P  table  (Student-Problem  table), 
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the  number  of  hypothetical,  equivalent  alternatives*,  interpretive 
structural  modeling  based  on  graph  theory,  and  so  forth.  Some  of 
their  papers,  which  the  author  has  had  the  opportunity  of  reading, 
are  listed  in  Appendix  III.  Their  standpoint  concerning  the  multiple- 
choice  item  is  based  on  information  theory  (e.g.,  Goldman,  1953), 
considering  that  an  item  is  a  good  one  if  its  expected  uncertainty 
in  the  selection  of  an  alternative  is  high.  As  the  measure  of 
the  quality  of  an  item,  the  number  of  hypothetical,  equivalent 
alternatives  (Sato,  1977)  is  used,  which  will  be  introduced  in 
Chapter  2.  One  impressive  feature  of  the  activities  of  this  group 
of  researchers  is  that  they  do  not  use  computers  mechanically, 
as  many  other  researchers  do,  but  they  give  teachers  the  feedback 
information  about  the  test  items  constantly,  and  then  they  obtain 
the  teachers'  feedback  based  on  the  content  analysis  of  the  items 
in  question,  and  so  on.  Another  group  is  Shiba  and  his  students  of 
the  School  of  Education,  University  of  Tokyo.  They  have  spent  the 

past  several  years  for  developing  vocabulary  tests,  which  are 
aimed  at  measuring  vocabulary  of  subjects  of  a  wide  range  of  age, 
collecting  data,  constructing  an  integrated  vocabulary  scale 
(Shiba,  1978),  and  then  constructing  a  tailored  test  out  of  these 
vocabulary  test  items,  using  the  information  given  by  the  distractors> 
as  well  as  the  correct  answers,  for  branching  examinees  (Shiba, 

Noguchi  and  Haebara,  1978).  The  theory  and  method  used  for  analyzing 
their  data  are  basically  the  same  as  those  adopted  in  the  research 
in  which  the  author  was  involved  (Indow  and  Samejima,  1962,  1966)- 


*Tatusoka  translated  the  original  word  as  the  effective  (or  equivalent) 
number  of  options,  but  the  author  uses  this  translation. 


The  outline  of  the  work  accomplished  by  Shiba  and  others  will  be 
given  in  Chapter  6. 

With  the  research  conducted  by  these  people  as  incentives, 
the  author  has  integrated  her  own  ideas  about  mathematical  models 
and  the  multiple-choice  item.  It  resulted  in  proposing  a  method  of 
validating,  or  invalidating,  the  three-parameter  logistic,  or  normal 
ogive,  model  and  the  knowledge  or  random  guessing  principle,  and 
eventually  proposing  a  new  family  of  models  for  the  multiple-choice 
item,  in  which  the  information  given  by  the  distractors  is  fully 
utilized. 
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II  Sato's  Number  of  Hypothetical,  Equivalent  Alternatives 

Let  g  (*l,2,...,n)  be  a  multiple-choice  test  item.  In  the 
present  paper,  however,  this  symbol  g  is  omitted,  whenever  it  is 
clear  that  we  deal  with  only  one  item.  Let  i  (-l,2,...,m)  be 
an  alternative,  or  an  option,  of  the  multiple-choice  item  g  ,  and 
p^  be  the  probability  with  which  the  examinee  selects  the  alternative 
i  .  The  entropy  H  is  defined  as  the  expectation  of  -log  p^ 
such  that 

m 

(2.1)  H  -  -  Z  p  log  _  p  , 
i-1  1  Z  1 

for  the  set  of  m  alternatives  of  item  g  .  It  is  obvious  from 

(2.1)  that  the  entropy  H  is  non-negative,  and,  if  one  of  the  m 
alternatives  is  the  sure  event  with  unity  as  its  probability,  then 

H  *>  0  .  Sato's  number  of  hypothetical,  equivalent  alternatives 
k  ,  is  defined  by 

(2.2)  k  -  2H  , 

and  is  used  as  an  index  of  the  effectiveness  of  the  set  of  m 
alternatives  for  item  g  in  the  context  of  information  theory. 

Since  the  entropy  H  indicates  the  expected  uncertainty  of  the 
set  of  m  events,  or  alternatives,  the  set  of  alternatives  is  more 
informative  for  a  greater  value  of  k  . 

When  the  probability  p^  is  replaced  by  the  frequency  ratio, 

,  we  can  write  for  the  estimate  of  the  entropy  such  that 

H  «  -  L  P,  log2P  , 
i-1  i  1 


(2.3) 
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and  for  the  estimate  of  k  we  have 


(2.4)  fc  -  2H  . 


We  notice  that  we  can  obtain  the  number  of  hypothetical, 

equivalent  alternatives  k  without  using  the  entropy,  for  we  have 

m 

-.L'l  l°82Pi  •  -P,  -  P._, 

-  n  p,  1  -  t  "  p  ‘j  1  . 
i-1  1  i-1  1 


(2.5) 


k  «  2H  -  2  1=1 


The  quantity  in  the  brackets  of  the  last  expression  of  (2.5)  is 
a  kind  of  weighted  geometric  mean  of  p^  .  Equation  (2.5)  also 
implies  that  we  can  use  any  base  for  log  p^  ,  instead  of  2  . 

For  convenience,  hereafter  we  shall  use  e  as  the  base  of  log  p^  , 
and  use  H*  instead  of  H  such  that 


(2.6) 


H* 


-  E  P.,  log  p  £  0  , 

i-1 


which  equals  zero  when  one  of  the  alternatives  is  the  sure  event,  and 


(2.7)  k-eB*  jl, 

and  simply  write  log  p^  instead  of  1°8  e  P^  • 

To  find  out  the  value  of  p^  which  maximizes  H*  ,  and  hence 
k  ,  we  define  Q  such  that 

m  m 

(2.8)  Q  -  -  I  p  log  p  +  \[  I  p.-l]  , 

i-1  1  1  i-1  1 

where  A  is  Lagrange's  multiplier.  Thus  the  partial  derivative  of 
Q  with  respect  to  p^  is  given  by 

-  -[log  pt  +  (l/pjL)pi]  +  X  -  -log  pt  +  (X  -  1)  . 


(2.9) 
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Setting  this  derivative  equal  to  zero,  we  obtain 


(2.10)  log  P±  -  X  -  1  , 


which  is  a  constant  regardless  of  the  value  of  i  .  Since  we  have 


(2.11)  Z  p  -  1  , 
i-1 

we  obtain 


(2.12)  Pi  -  1/m 


Thus  it  is  clear  that  H*  ,  and  hence  k  ,  is  maximal  when  all  the 
m  alternatives  are  equally  probable,  and  we  can  write 


(2.13)  max  (H*)  *  log  m 


and 


(2.14)  max  (k)  ■  m  . 

Since  in  the  present  situation  the  m  events  are  alternatives, 
the  values  of  H*  and  k  are  affected  by  the  difficulty  level  of 
item  g  .  Let  R  be  the  correct  answer  to  item  g  ,  which  is  given 
as  one  of  its  alternatives,  and  pR  be  the  probability  with  which 
the  examinee  selects  the  correct  answer  R  .  Figure  2-1  presents 

the  relationship  between  the  probability  p„  and  the  number  of 

s\ 

hypothetical,  equivalent  alternatives  k  .  In  this  figure,  the 
area  marked  by  slanted  lines  indicates  the  set  of  k's  which  are 
less  than  max  0c ) PR)  and  greater  than  max[l/pR,  min  (k|pR)],  and 
are  considered  to  be  reasonable  values  of  k  by  Sato  and  others. 
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Relationship  between  the  Probability  with  Which  the  Correct 
Answer  R  Is  Selected  and  the  timber  of  Hypothetical, 
Equivalent  Alternatives,  for  Five-Choice  I teas. 
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In  practice.  Figure  2-1  Is  used  by  replacing  the  probability 
pR  by  the  proportion  correct,  PR  ,  and  the  number  of  hypothetical, 
equivalent  alternatives,  k  ,  by  Its  estimate  £  .  It  is  well-known 
that  the  frequency  ratio  is  both  the  least  squares  solution  and  the 
maximum  likelihood  estimator  of  the  corresponding  probability. 

It  is  interesting  to  note  that,  in  addition,  it  is  the  estimator 
which  minimizes  the  chi-square  statistic.  Let  us  define  Q  such 
that 

m  m 

(2.15)  Q  -  Z  [(NP  -  Np  )*/(Np  )1  +  X[  Z  p  -  1]  . 

i-1  i-1  1 

where  N  is  the  number  of  examinees  and  X  is  Lagrange's  multiplier. 
Then  we  have 

(2.16)  -JS-  -  N[(p*  -  P*)/PJ]  +  X  -  0  , 
and 

(2.17)  -  [1  +  (X/N)f1/2  Pt  . 

Since 

(2.18)  1  -  I  p  -  (1  +  (X/N)f1/2  Z  P  -  [1  +  (X/N)]‘1/2  , 

i-1  i-1  * 

we  obtain 

(2.19)  X  -  0  , 

and  from  this  and  (2.17)  we  can  write 

(2.20)  ^  -  P±  . 
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The  translation,  "the  number  of  hypothetical,  equivalent 
alternatives,"  Indicates  the  number  of  alternatives  in  the 
hypothetical  situation  where  the  entropy  H  is  provided  by  the 
alternatives  which  are  equivalent  in  the  uncertainty  of  occurence. 
Although  it  is  not  the  direct  translation  of  the  original  word, 
it  is  used  for  k  in  the  present  paper,  for  it  seems  to  the  author 
to  be  the  best  describing  word  of  the  original. 
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III  Information  Given  by  Distractors  in  the  Multiple-Choice  Item 
and  Random  Guessing 

Sato's  number  of  hypothetical,  equivalent  alternatives  has 
been  used  mainly  by  the  members  of  the  Technical  Group  of  Educational 
Technologists  in  Japan  (cf.  Tatsuoka,  1979)  for  the  purpose  of 
analyzing  the  effectiveness  of  alternatives  in  relation  with 
a  relatively  small  group  of  examinees.  The  basic  idea  behind 
this  index  is  that  the  expected  uncertainty  of  the  m  events,  or 
alternatives,  be  large,  and,  therefore,  the  number  of  hypothetical, 
equivalent  alternatives  be  close  to  m  .  We  notice  that: 

(1)  this  concept  is  strongly  population-oriented,  unlike  those 
concepts  in  latent  trait  theory, 

(2)  it  is  assumed  that  each  examinee  tries  to  answer  the  item 
seriously,  without  depending  upon  random  guessing, 

and, 

(3)  relative  to  the  population  of  examinees,  the  existence  of 
too  attractive  a  distractor  is  not  desirable,  since  it 
tends  to  reduce  the  value  of  k  . 

Thus  as  long  as  this  index  is  used  for  the  analysis  of  test  items 
which  are  given  with  careful  guidance  and  supervision  to  samples 
of  examinees  from  a  well-defined  population,  and  the  findings  of 
the  analysis  are  not  generalized  across  populations,  it  will  serve 
its  purpose. 

If  we  generalize  this  concept  and  the  resultant  findings 
beyond  these  restrictions,  however,  we  may  be  led  to  completely 
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false  conclusions.  To  give  an  extreme  example,  suppose  that  none 
of  our  examinees  took  the  test  seriously,  and  selected  one  of  the 
alternatives  at  random,  for  each  item  of  the  test.  In  such  a  case, 
regardless  of  the  difficulty  level  of  the  item,  the  number  of 
hypothetical,  equivalent  alternatives,  k  ,  will  be  very  close  to 
m  for  every  item'.  In  spite  of  this  superficial  success,  we  have 
obtained  no  information  about  the  individual  examinees'  ability 
levels  as  the  result  of  testing. 

It  is  also  noted  that,  if  the  examinee's  behavior  follows 
the  knowledge  or  random  guessing  principle,  i.e.,  he  will  answer 
correctly  if  he  knows  the  answer,  or  guess  randomly  otherwise,  the 
value  of  k  tends  to  be  large.  In  this  case,  too,  our  success 

of  obtaining  a  large  k  is  only  superficial  and  meaningless. 

In  addition  to  the  above  facts,  it  is  obvious  that  the  value 
of  the  number  of  hypothetical,  equivalent  alternatives  varies  for 
different  populations,  i.e.,  the  same  item  may  have  a  value  of 
k  which  is  very  close  to  m  for  one  population  of  examinees, 
and  may  have  a  very  low  value  for  another  population.  This  may 
be  due  to  the  difference  in  the  mean  ability  levels  of  the  two 
populations,  or  to  the  different  forms  of  two  ability  distributions, 
or  both.  Thus  while  the  index  may  be  useful  for  a  fixed  population 
of  examinees  and  if  we  discuss  "how  good  an  item  is"  in  relation  to 
that  specific  population,  it  cannot  be  considered  as  a  parameter 
of  the  item  per  se.  This  limitation  of  the  usefulness  of  k 
is  of  the  same  kind  that  is  applicable  for  the  reliability  coefficient 
of  the  test,  i.e.,  in  spite  of  most  psychologists'  belief  that 
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the  reliability  coefficient  is  one  of  the  most  important  and  solid 
properties  of  the  test  itself,  it  heavily  depends  upon  the  specific 
population  of  examinees  for  which  the  test  is  administered,  and, 
therefore,  is  a  dead  concept  since  the  population-free  test  information 
function  is  sufficient  to  serve  the  purpose  (Samejima,  1977a). 

As  a  whole,  there  is  no  single  answer  to  the  question:  "Are 
items  which  have  high  values  of  the  number  of  hypothetical,  equivalent 
alternatives  good  items?"  even  if  we  control  the  testing  situation 
with  respect  to  the  purpose  of  testing,  such  as  guidance,  selection, 
etc.  This  is  true  even  if  we  restrict  the  populations  of  examinees, 
and  it  is  mainly  because  of  the  noise  induced  by  random  guessing. 

That  is  to  say,  in  a  general  situation  of  testing,  it  is  hard  for 
us  to  determine  whether  we  have  accomplished  the  work  by  obtaining 
a  high  value  of  k  .  In  fact,  the  largest  possible  value  of  k 
may  imply  no  accomplishment  at  all,  as  we  have  seen  in  one  of  the 
preceding  paragraphs  of  the  present  chapter! 

In  spite  of  the  above  limitations,  however,  the  introduction 
of  the  number  of  hypothetical,  equivalent  alternatives  and  its  use 
by  Sato  and  other  researchers  of  the  Technical  Group  of  Educational 
Technologists  should  be  well  credited,  for  their  vision  is 
oriented  toward  the  full  use  of  the  information  given  by  all  the 
alternatives  of  the  multiple-choice  item.  It  seems  that  they 
are  quite  successful  in  using  the  index  in  the  small  group  situation, 
such  as  school  classes  where  instructions  are  well  conveyed  and 
random  guessing  is  extremely  discouraged.  This  orientation  is  in 
quite  a  contrast  to  the  attitude  of  many  researchers  who  are  accustomed 
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IV  Three-Parameter  Models  in  Latent  Trait  Theory  and  the  Role 
of  Item  Distractors 

Let  0  be  ability,  or  latent  trait,  that  we  intend  to  measure 
with  our. test.  The  three-parameter  logistic  model,  or  normal  ogive 
model,  is  based  upon  the  knowledge  or  random  guessing  principle,  i.e., 
the  examinee  either  knows  the  answer  or  guesses  randomly.  Let  ^(B) 
be  the  item  characteristic  function  of  item  g  ,  which  is  the 
conditional  probability  with  which  the  examinee  answers  item  g 
correctly,  given  9  ,  in  the  free-response  situation.  This  is  given 
by 

,-1/2 


V  (0)  -  (2tt) 

O 


ag(e'V  -u2/2 
8  8  e  '  du 


(4.1) 

in  the  normal  ogive  model,  and 

(4.2)  'i  (9)  *  [1  +  exp{-Da  (9-b  )}] 

8  g  S 


-1 


in  the  logistic  model,  where  a^  is  the  item  discrimination  parameter 
and  bg  is  the  item  difficulty  parameter  (Lord  and  Novick,  1968, 
Chapter  16),  and  D  in  (4.2)  is  the  scaling  factor  which  assumes 
1.7  (Birnbaum,  1968)  when  the  logistic  model  is  used  as  a  substitute 
for  the  normal  ogive  model. 

The  item  characteristic  function,  P  (8)  »  for  the  multiple- 
choice  item  in  the  three-parameter  normal  ogive,  or  logistic,  model 
is  defined  by 


(4.3)  P  (8)  -  ¥  (0)  +  [l-¥  (0)]c  -  c  +  [1-c  JV  (0)  , 

6  g  g  8  o  6  6 

where  ¥  (0)  is  given  by  (4.1)  or  (4.2)  and  c  is  a  constant  which 
8  ® 
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is  called  the  guessing  parameter,  and  equals  1/m  ,  or  1/m  . 

g  v 

It  should  be  noted  that,  following  these  models,  there  is 
no  information  given  by  the  alternatives  other  than  the  correct 
answer,  for  all  the  responses  to  the  wrong  answers  are  the  result 
of  random  guessing.  Should  one  of  these  models  be  valid  for  the 
item  in  question,  the  multiple-choice  item  would  be  nothing  but 
a  poor  image  of  the  binary,  free-response  item,  which  is  contaminated 
by  the  noise  caused  by  random  guessing. 

Let  j  be  an  individual  examinee,  and  u^  be  the  binary 
item  score  for  the  multiple-choice  item  g  .  The  conditional 

expectation  and  variance  of  the  binary  item  score  u  ,  given  6  , 
can  be  written  as 


(4.4)  E(u|0)  -  P  (0)  -  c  +  (l-c)f„ (0)  =  (1/m) [1  +  (m-l)¥  (0)]  , 

©6  6 

where  c  is  the  simplification  of  cg  ,  and 

(4.5)  Var. (u | 0)  -  [(m-l)/m2] [1-Y  (0) ] [l+(m-l)Sf  (0) ]  . 

Let  u^  be  the  binary  alternative  score  for  the  alternative  i 
obtained  by  the  individual  j  ,  for  the  multiple-choice  item  g  . 
Thus  we  can  write 


(4.6)  Ujy  -  Uj  . 

The  conditional  expectation  and  variance  of  the  binary  alternative 
score  u^  (i?*R)  ,  given  0  ,  are  given  by 

(4.7)  E(u± 1 0)  -  cll-V  (0)3  -  (l/mHl-Yg(0)] 
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and 

(4.8)  Var. (u  | 6)  =  (1/m2) [1-T  (6) ] [ (m-l)+¥  (6)]  . 

1  o  8 

Let  X  be  either  u  or  ,  or  any  other  discrete  random  variable, 
and  p(A)  and  p ( X | 0 )  denote  the  marginal  and  conditional  probability 
functions  of  A  ,  respectively.  Then  the  relationships  among  the 
conditional  and  unconditional  expectations  and  variances  are  given 
by 

(4.9)  E(A)  =  Z  Ap(A)  =  ZA  jZ  p(A|0)f(0)d0  =  fZ  Z  X  p(A|0)f(0)d0 

=  JZ  E(A|0)f(0)d0  =  E  [E(  A  |  0)  ] 

and 

(4.10)  Var. (A)  =  Z[A-E(A) ]2p(A)  =  Z[A-E(A)]2  j'Z  p(A|0)f(0)d0 

=  fZ  E[A'E(A|e)]2p(A|0)f(e)d6 

+  fZ  [E(A|0)-E(A)]2Zp(A|e)f(6)d0 

-  E[Var. (A | 0) ]  +  E[E(A|0)-E(A)]2. 

In  particular,  we  can  write 

(4.11)  E(u)  -  E [ E ( u [ 0) ]  -  J'Z  *  (B)f(e)dd  =  PR 
and 

(4.12)  Var. (u)  “  E[Var.(u}0)]  +  E[E(u| 0)-E(u) ] 2 

-  fZ  Pg(9) Cl-PgCe) ]f (0)d0  +  j’Z  [Pg(0)-PR] 2f (0)d0 

=  PR  "  PR2  "  PR(1'PR} 

for  the  binary  item  score  u  ,  and,  for  the  alternative  score  , 
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(4.13)  E(Ui)  -  E[E(u±|6)]  -  (1/m)  [l-fg(0) Jf (6)d6 


[1/ (m-1) ] 


f- 


^  [i-pg(0)]f(e>de 


[l/(m-l)](l-pR) 


and 


(4.14)  Var.(Ul)  -  E[Var.  (ujQ)  ]  +  E  [E(u^  |  Q)-E(u^)  ] 2 

-  a/®2)  j'Z  [i-vg(e)n(m-i)+¥g(e)if(e)dQ 
+  (1/m 2)  fZ  [{l-^g(0)}-®Pi]2f(0)d0 


(1/m) 


r- 


[l-¥  (0)]f(0)d0 
g 


-  2Pl(l/m) 


r- 


[l-¥  (0)]£(0)d0  +  p2 

o  A 


pi<1'pi>  * 


We  notice  that  E(u)  given  in  (4.11)  is  the  item  difficulty  parameter 
in  classical  test  theory,  which  depends  upon  the  specific  population 
of  examinees  as  well  as  the  test  item. 

It  should  be  noted  that  both  the  expectation  and  the  variance 
of  for  i^R  ,  which  are  given  by  (4.13)  and  (4.14),  respectively, 

are  equal  for  all  the  wrong  answers,  and  are  determined,  solely,  by 
pR  and  the  number  of  the  alternatives,  m  .  This  is  the  logical 
consequence  of  the  fact  that  the  responses  to  those  wrong  answers 
are  completely  the  result  of  random  guessing,  and  provide  us  with 
no  information  about  the  examinees'  ability  levels. 

We  must  remember,  however,  that  most  of  the  conscientious 
test  constructors  try  to  avoid  the  contamination  of  the  quality  of 
items,  by  finding  incorrect ,  but  plausible,  answers  and  including 
them  as  distractors  in  the  set  of  alternatives.  This  indicates 
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that  the  responses  to  these  alternatives  are  not  the  result  of  random 
guessing,  and  may  contain  useful  information  about  the  examinee's 
ability  level.  The  adoption  of  one  of  the  three-parameter  models 
for  such  multiple-choice  items  is  not  justifiable,  since  in  so  doing 
the  researchers  distort  psychological  reality  and  will  produce 
nothing  but  meaningless  artifacts  as  the  result  of  their  research. 

It  is  strange  to  the  author  that  many  researchers  have  ignored 
the  contradiction  which  was  described  in  the  preceding  paragraphs, 
and  have  applied  the  three-parameter  models  to  their  data  for  years, 
which,  obviously,  are  based  on  the  tests  containing  many  distractors. 

As  far  as  they  continue  repeating  this  mistake,  their  conscientiousness 
as  researchers  has  to  be  questioned. 
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V  Index  k*  for  Invalidating  Three-Parameter  Models 

It  has  been  pointed  out  in  Chapter  3  that  Sato’s  number  of 
hypothetical,  equivalent  alternatives  takes  on  a  high  value,  if 
every  examinee  in  the  group  has  selected  one  of  the  m  alternatives 
at  random.  This  fact  implies  that,  although  the  index  was  introduced 
for  quite  an  opposite  purpose,  it  may  also  be  useful  in  detecting 
the  examinee's  random  guessing  behavior  in  the  multiple-choice 
item. 

To  materialize  the  above,  we  need  the  following  consideration. 
When  the  examinee  follows  the  knowledge  or  random  guessing  principle 
and  the  item  characteristic  function  assumes  the  three-parameter 
logistic,  or  normal  ogive,  model,  the  index  k  is  solely  affected 
by  the  probability  with  which  the  examinee  knows  the  answer,  as  is 
obvious  from  Figure  2-1  and  (4.3)  and  (4.11).  This  fact  provides 
some  inconvenience,  however,  for  the  probability  of  knowing  the 
answer  heavily  depends  upon  the  specific  population  of  examinees,  in 
addition  to  the  item  characteristic  function  of  the  item  in  the 
free-response  situation.  It  will  be  more  convenient,  therefore, 
if  we  can  modify  Sato's  index  k  in  such  a  way  that  it  is  unaffected 
by  the  ability  distribution  of  a  specific  population  of  examinees, 
and  can  be  considered  as  a  pure  property  of  the  item.  With  this 
aim  in  mind,  we  shall  introduce  a  new  index  in  this  chapter. 

Let  A  be  the  event  that  the  examinee  does  not  know  the 
answer  to  item  g  ,  and  consider  the  probability  space  which 
consists  of  such  a  subpopulation  of  examinees.  The  conditional 
probability,  p(i|A)  ,  with  which  the  examinee  selects  the  alternative 
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i  of  item  g  in  this  conditional  probability  space  is  given  by 

i*R 

i-R 

where  p*  denotes  the  probability  with  which  the  examinee  guesses 
correctly  for  item  g  .  The  new  index,  k*  ,  is  defined  in  terms 
of  these  conditional  probabilities,  in  such  a  way  that 

(5.2)  k*  -  exp[-  Z  p (i  |  A)  •  log  p<i|  A)  ]  =»  [  S  p(i  |  A)p(i  ' A)  l'1  . 
i=*l  i=l 

It  is  obvious  that  p(i|X)  for  i^R  is  proportional  to  p^  ,  for 
every  examinee  in  the  population  who  has  selected  one  of  the  wrong 
answers  does  not  know  the  answer,  and,  consequently,  he  is  also 
in  the  subpopulation  A  .  On  the  other  hand,  examinees  who  have 
selected  the  correct  answer  R  are  not  necessarily  in  the 
subpopulation  A  ,  so  we  can  write 

(5.5)  P*  <  PR  • 

Note  that,  if  the  examinee's  behavior  follows  the  knowledge  or  random 
guessing  principle  and  the  item  characteristic  function  of  the 
multiple-choice  item  g  is  of  one  of  the  three-parameter  models, 
p*  equals  p^  for  i^R  ,  and,  as  the  result,  all  the  m  p(i|A)'s 
are  equal  and  k*  -  m  . 

In  practice,  we  need  to  use  some  estimates  for  p(i|A)'s  , 
to  obtain  the  estimate  of  k*  .  Since  we  have  the  frequency  ratio, 

,  for  the  estimate  of  p^  for  i^R  ,  all  we  need  to  do  is  to 


(5.1) 


p(i|  A) 


(m  p, (  1  p,  +  pi]-1 

'  1  ijtR  1  R 


Pit  2  p  +  p*] 
R  i*R  1  R 


-1 
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find  ouc  an  appropriate  estimate  of  p*  .  Let  P*  denote  such 

K  K 

an  estimate  of  p*  ,  and  P*  be  such  that 


(5.4)  P* 


.(-pi 

t-  p! 


i^R 
i-R  . 


Then  we  can  write  for  the  estimate  of  p(i|A)  such  that 


(5.5)  p(i|A)  -  P*[  Z  P*]'1  . 

1  i-1 


We  are  to  take  the  strategy  of  finding  P*  which  makes  k*  maximal. 

K 

Define  H*  such  that 

m 

(5.6)  H*  -  log  k*  -  -  Z  p(i (A) ’log  p(i|A) 

i-1 

m  -  m  m  m 

-  -[  Z  P*]'  [  Z  P**log  P*  -  (  Z  P*)-log  {  Z  p*}]  . 
s-1  3  i-1  1  1  i-1  1  s-1  9 

Then  the  partial  derivative  of  H*  with  respect  to  P*  can  be 
written  as 


(5.7)  ~ •  -  [  Z  P*]"Z[  l  PJ'log  PJ  -  (  S  P*)*log  P*)  , 

R  s-1  i-1  s-1 


and,  setting  this  equal  to  zero,  we  obtain 


(5.8)  log  P*  -  [  Z  PI'1  Z  P  -log  P 
R  s*R  3  i^R  1  1 


and  then 


V  1  Ps] 

(5.9)  P*  -  HP.  si<R  3 
R  ii«R  1 


-1 


Thus  we  can  use  (5.9)  in  (5.4),  and,  therefore,  obtain  p(i|A) 


through  (5.5).  The  estimate  of  the  new  Index,  k*  ,  is  given  by 


(5.10)  k*  -  exp[-  Z  p(i | A) *log  p(i|A) ]  -  [  R  p(i | A)g(i I A) f1  • 
i-1  i-1 

A  necessary,  though  not  sufficient,  condition  for  one  of  the  three- 

A 

parameter  models  to  be  valid  is  that  k*  should  be  equal  to  m 
within  sampling  fluctuations,  regardless  of  the  population  of 
examinees  from  which  our  sample  happened  to  be  selected.  If  this  is 
not  the  case,  we  must  say  that  the  three-parameter  model  does  not 
fit  our  item,  i.e.,  the  invalidation  of  the  model. 

Although  the  invalidation  of  the  three-parameter  logistic, 
or  normal  ogive,  model  is  easy,  its  validation  is  more  difficult. 

We  recall  that  Sato's  number  of  hypothetical,  equivalent  alternatives 
is  used  as  a  measure  of  the  desirability  of  the  item  for  a  specific 
population  of  examinees.  If  all  the  distractors  are  equally  probable 
for  a  specific  population,  then  the  index  k*  will  also  equal  m  , 
in  spite  of  the  fact  that  the  two  cases  are  completely  different 
in  nature.  This  problem  can  be  solved  by  administering  the  same 
test  to  a  different  group  of  examinees,  which  has  a  different 
ability  distribution  from  that  of  the  first  group.  If  the  large 
value  of  k*  is  due  to  the  knowledge  or  random  guessing  principle, 
then  it  will  also  be  large  for  the  second  group  of  examinees  because 
of  its  population- free  nature.  On  the  other  hand,  if  the  large 
value  of  k*  is  resulted  from  the  optimal  quality  of  the  item  for 
the  first  group  of  examinees,  then  it  will  not  be  as  large  as  that 
for  the  second  group,  unless  the  operating  characteristics  of  all 
the  distractors  are  identical. 
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It  should  be  emphasized  that  k*  takes  on  a  large  value  even 
if  the  knowledge  or  random  guessing  principle  does  not  work  behind 
the  examinee's  behavior,  but  the  item  is  "suitable"  for  the  group 
of  examinees  to  which  the  test  has  been  administered,  in  the  same 
sense  that  a  high  value  of  Sato's  number  of  hypothetical,  equivalent 
alternatives  is  meant  to  indicate.  This  fact  means  that,  when  we 
need  to  use  only  one  set  of  data  for  validating,  or  invalidating, 
the  knowledge  or  random  guessing  principle  and  the  three-parameter 
logistic,  or  normal  ogive,  model,  we  must  use,  at  least,  one  more 
necessary  condition  for  the  principle  to  be  valid.  One  such 
necessary  condition  is  that  the  sample  means  of  ability  0  ,  or 
of  its  estimate,  of  the  subgroups  of  examinees  who  have  selected 
the  wrong  answers  should  be  equal,  within  the  range  of  sampling 
fluctuations.  Thus,  if  either  the  value  of  k*  is  substantially 
less  than  m  ,  or  the  sample  means  of  ability  0  of  such  subgroups 
of  examinees  are  not  close  to  each  other,  then  we  shall  be  able 
to  say  that  the  knowledge  or  random  guessing  principle  and  the 
three-parameter  model  are  invalidated.  On  the  other  hand,  if  both 
of  the  necessary  conditions  are  satisfied  with  our  data,  we  can  say 
there  is  no  reason  to  reject  the  principle  and  the  model. 

For  the  purpose  of  illustration,  a  set  of  Simulated  data  was 
calibrated,  using  the  Monte  Carlo  method.  In  this  set  of  data, 
five  hypothetical  multiple-choice  test  items  were  assumed,  each 
having  five  alternatives,  A,  B,  C,  D  and  E,  with  A  always  as  the 
correct  answer.  Each  item  is  assumed  to  follow  the  three-parameter 
normal  ogive  model,  which  is  given  by  (4.1)  and  (4.3),  with  the 
parameter  values  shown  in  Table  5-1.  A  group  of  five  hundred 


TABLE  5-1 


leea  Discrimination.  Parameter  a  and 

g 

Item  Difficulty  Parameter  of  Each 

of  the  Five  Hypothetical,  Binary  Items 
Following  the  Three-Parameter  Normal 


Ogive  Model,  with 

c  -  0.2  . 
g 

Item 

a 

b 

g 

g 

1 

1.00 

0.00 

2 

1.50 

0.00 

3 

2.00 

0.00 

4 

2.50 

0.00 

5 

3.50 

0.00 
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hypothetical  examinees  was  assumed,  whose  ability  levels  ate  placed 
at  one  hundred  equally  spaced  points  on  the  ability  continuum, 
which  start  with  -2.475  and  end  with  2.475,  in  such  a  way  that 
subjects  1  through  5  are  placed  at  6  -  -2.475  ,  subjects  6  through 
10  are  at  0  *  -2.425  ,  and  so  on.  For  each  of  the  five  hypothetical 

multiple-choice  items,  the  response  of  each  of  the  five  hundred 
hypothetical  examinees  was  calibrated  according  to  the  specified 
item  characteristic  function  and  the  knowledge  or  random  guessing 
principle.  These  calibrated  responses  are  presented  as  Table  A-l 
in  Appendix  1. 

Table  5-2  presents  the  frequency  ratio,  ,  of  each  of 

the  five  alternatives,  for  each  of  the  five  hypothetical  multiple- 
choice  items.  We  can  see  that  sampling  fluctuations  are  fairly 
large  for  item  4,  and  to  a  less  degree  for  item  2,  since  the 
corresponding  probability,  p^  ,  is  0.6  for  the  alternative  A  and 
0.1  for  each  of  the  alternatives  B,  C,  D  and  E.  In  the  same  table, 
also  presented  are  the  values  of  P*  ,  which  were  obtained  through 
(5.9).  Using  these  values  in  (5.6),  (5.9)  and  (5.10),  the  estimates 
of  the  entropy  H*  and  the  index  k*  were  obtained,  and  are 
presented  in  Table  5-3.  Since  the  maximal  possible  value  of  H* 
is  approximately  1.60944  ("log  m)  and  that  of  £*  is  5  («m) ,  we 
can  say  that  these  results  are  sufficiently  close  to  their  respective 
maximal  values,  i.e.,  an  exemplification  of  the  satisfaction  of  one 
of  the  necessary  conditions  for  validating  the  three-parameter 
normal  ogive  model  and  the  knowledge  or  random  guessing  principle 
by  our  simulated  data.  The  fact  that  these  results  are  less 

J 
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TABLE  5-2 

Frequency  Ratio  of  the  Subject,  P^  ,  Who  Selected 

Each  of  the  Five  Alternatives,  and  the  Modified 

Frequency  Ratio  P*  for  the  Correct  Answer  A, 

& 

for  Each  of  the  Five  Hypothetical  Items. 


TABLE  5-3 
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Entropy,  H*,  and  the  Number  of  Hypothetical , 
Equivalent  Alternatives,  £*  ,  for  Each  of 
the  Five  Hypothetical  Items  Following  the 
Three-Parameter  Normal  Ogive  Model. 


Item 

H* 

1 

1.60714 

4.98853 

2 

1.60501 

4.97789 

3 

1.60744 

4.99000 

4 

1.59224 

4.91475 

5 

1.60829 

4.99424 
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satisfactory  for  item  4  and  the  same  is  true,  to  a  lesser  degree, 
for  item  2  must  be  due  to  the  sampling  fluctuations,  which  were 
observed  in  Table  5-2. 

As  another  necessary  condition  for  validating  the  three- 
parameter  normal  ogive  model  and  the  knowledge  or  random  guessing 
principle,  the  mean  of  0  for  each  of  the  five  subgroups  of 
examinees,  who  selected  different  alternatives,  was  computed,  for 
each  of  the  five  multiple-choice  items.  Table  5-4  presents  the 
result  of  these  means  of  0  .  In  the  same  table,  also  presented 
is  the  expectation  of  0  for  each  of  the  five  subgroups,  using 
the  uniform  ability  distribution  for  the  interval,  [-2.5,  2.5], 
for  each  item,  following  the  three-parameter  normal  ogive  model 
and  the  knowledge  or  random  guessing  principle.  Since  all  the 
responses  to  one  of  the  four  wrong  answers  of  each  item  are  nothing 
but  the  result  of  random  guessing,  these  alternatives  are  equivalent, 
and  have  the  same  mean  value  of  0  .  We  can  see  that,  for  each 
item,  the  mean  of  0  for  the  correct  answer  and  that  of  each 
incorrect  answer  are  substantially  different,  and  they  are  close 
enough  to  the  respective  theoretical  means. 

In  practice, ' there  is  no  way  to  observe  the  examinee's  0 
itself.  We  can  use  its  maximum  likelihood  estimate,  §  ,  however, 
and  use  it  as  the  substitute  in  the  above  process,  for  example. 

We  must  obtain  a  similar  result  as  above,  to  validate  the  three- 
parameter  models  and  the  knowledge  or  random  guessing  principle. 


We  notice  that  a  similar  result  as  the  one  in  our  example 
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TABLE  5-4 

Sample  Mean  of  0  for  the  Subgroup  of  Hypothetical  Examinees  Who 
Selected  Each  of  the  Five  Alternatives,  and  Its  Corresponding 
Theoretical  Mean,  for  Each  of  the  Five  Multiple-Choice  Items. 


^Alternative 

IteST-"^^ 

A 

(Correct) 

B  C  D  E 

(Incorrect) 

E(0) 

0 

0 

E(0) 

1 

0.703 

0.619 

-0.912  -1.017  -0.994  -0.905 

-1.054 

2 

0.774 

0.752 

-1.341  -1.084  -1.249  -1.161 

-1.161 

3 

0.800 

0.811 

-1.165  -1.233  -1.224  -1.237 

-1.200 

4 

0.812 

0.809 

-1.230  -1.119  -1.253  -1.369 

-1.218 

5 

0.822 

0.809 

-1.061  -1.193  -1.260  -1.282 

-1.234 
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can  be  obtained,  if,  incidentally,  all  the  distractors  require 
"on  the  average"  approximately  the  same  level  of  ability  for  the 
examinee  to  be  attracted  to  them,  for  our  group  of  examinees. 

This  fact  indicates  that  it  is  desirable  to  add  more  necessary 
conditions  to  examine,  such  as  the  approximate  equality  of  the 
second  moment  of  8  ,  or  §  ,  that  of  the  third  moment,  etc., 
for  the  subgroups  of  examinees  who  have  selected  the  wrong  answers. 
Since  these  subgroups  of  examinees  are  "equivalent"  in  ability 
distribution  if  the  knowledge  or  random  guessing  principle  and 
the  three-parameter  model  are  valid,  these  higher  moments  should 
be  equal  within  sampling  fluctuations,  which  it  is  highly  unlikely 
that  all  the  subgroups  of  examinees  who  have  been  attracted  to 
separate  distractors  are  equivalent  in  ability  distribution.  We 
must  avoid,  however,  using  moments  of  too  high  degrees,  for  their 
sampling  fluctuations  tend  to  be  enormously  great. 


r 
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VI  Shiba 1 s  Research  on  the  Measurement  of  Vocabulary 

In  this  chapter,  we  shall  introduce  a  research  on  the 
measurement  of  vocabulary,  which  was  conducted  by  Shiba  and  others. 
The  author  found  it  interesting,  especially  in  the  following  aspects. 

(1)  The  vocabulary  tests  they  used  are  very  well  constructed, 
choosing  each  alternative  carefully. 

(2)  Subjects  were  selected  from  many  different  age  groups. 

(3)  Unlike  many  researchers  in  the  United  States,  they  have 
tried  to  make  a  full  use  of  the  distractors. 

The  battery  of  tests  used  for  the  construction  of  the 
vocabulary  scale  consists  of  eleven  tests,  Al,  A2,  A3,  A4,  A5,  A6, 

Jl,  J2,  SI,  S2  and  U  .  Each  test  contains  thirty  to  fifty-eight 
multiple-choice  items,  each  having  a  set  of  five  alternatives. 

These  tests  differ  in  difficulty,  and  each  of  them  is  designed  for  a 
different  group  of  ages,  ranging  from  six  years  of  age  to  the  ages  of 
college  students.  There  are  subsets  of  items  included  in  two  tests, 
which  are  adjacent  to  each  other  in  difficulty.  For  example, 
items  37  through  56  of  Test  Jl  are  also  items  1  through  20  of  Test 
J2.  The  number  of  examinees  used  for  the  vocabulary  scale 
construction  varies  between  412  sixth  graders  of  elementary  schools 
for  Test  A5  and  924  second  graders  of  senior  high  schools  for  Test 
SI.  (cf.  Shiba,  1978.) 

The  model  adopted  for  the  item  characteristic  function  of 
each  vocabulary  item  is  the  logistic  model,  such  that 

P  (0)  -  [1  +  exp{-Da  (9-b  )}]  1  , 

o  O  O 


(6.1) 
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vhere  a^  and  are  the  item  discrimination  and  difficulty 

parameters,  respectively,  and  D  »  1.7  .  Note  that  Shiba  did  not 
use  the  three-parameter  logistic  model,  which  is  characterized  by 
(4.2)  and  (4.3).  This  is  based  on  his  belief  that  three-parameter 
models  are  not  applicable  for  well-developed  multiple-choice  items, 
which  he  has  formed  through  his  many  experiences  in  test  construction 
and  research. 

Each  of  the  eleven  tests  was  administered  to  a  group  of  subjects 
who  belong  to  a  single  school  year,  except  for  college  students. 
Hereafter,  for  convenience,  we  shall  use  EL  for  elementary  schools, 

JH  for  junior  high  schools,  SH  for  senior  high  schools,  and  CS  for 
colleges,  and  add  the  school  year  after  each  symbol.  For  instance, 
by  SH2  we  mean  a  group  of  subjects  who  are  in  the  second  year  of 
senior  high  schools.  The  correspondence  of  the  subject  groups  and 
the  tests  administered  is  summarized  as  follows: 


A1  for  ELI  (650)  ,  A2  for  EL2  (650) ,  A3  for  EL3  (546) , 

A4  for  EL4  (617),  A5  for  EL5  (599),  A6  for  EL6  (412), 

J1  for  JH1  (614),  J2  for  JH2  (758),  SI  for  SHI  (924), 

S2  for  SH2  (759)  and  U  for  CS  (740)  , 

where  the  numbers  in  parentheses  indicate  respective  numbers  of 
examinees.  Note  that  JH3  and  SH3  are  not  included  in  the  data 
which  are  the  basis  of  the  vocabulary  scale  construction. 

The  main  steps  for  analyzing  these  data  are  the  following. 

[A]  For  each  of  the  eleven  groups  of  examinees,  the  ability 

distribution  is  assumed  to  be  the  standard  normal  distribution. 


[B]  Assuming  the  normal  ogive  model,  such  that 
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(6.2) 


pg(6) 


(2ir) 


-1/2  j' 


a  (0-b  ) 

8  g 


where  a  and  b  are  the  item  discrimination  and  difficulty 

s  s 

parameters,  respectively,  and  the  local  independence  of  the 
item  variables  (Lord  and  Novick,  1968,  Chapter  16),  and  also 
that  the  regression  of  each  item  variable  on  ability  0  is 
linear,  the  tetrachoric  correlation  coefficient  is  computed 
for  each  and  every  pair  of  items. 

[C]  The  principal  factor  solution  of  factor  analysis  is  applied 
for  the  correlation  matrix  thus  obtained,  using  the  largest 
absolute  value  of  the  correlation  coefficient  in  each  row, 
or  column,  as  the  communality.  This  step  is  also  the  process 
of  validating  the  uni-dimensionality  of  ability  0  .  Figure 
6-1  illustrates  the  resulting  set  of  eigenvalues  for  Test  J1 
which  was  administered  to  614  first  year  junior  high  school 
students.  It  turned  out  that  the  first  eigenvalue  is  much 
larger  than  all  the  other  eigenvalues,  and  thus  the  uni¬ 
dimensionality  was  confirmed.  Hereafter,  this  first  principal 
factor  is  treated  as  6  . 


[D]  From  the  result  of  factor  analysis,  the  item  parameters  are 
obtained. 

Maxwell,  1971)  of  the  first  principal  factor,  or  9  ,  for  item 
g 

,-1/2 


Let  Pg  be  the  factor  loading  (e.g.,  Lawley  and 


The  item  discrimination  parameter,  ag  ,  is  obtained  by 


g 


p  (1-p  ) 
g  g 


(6.3) 
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Let  $(u)  denote  the  standard  normal  distribution  function, 
such  that 


(6.4) 


<S>(u)  *  (2tt) 


-1/2  j'*  fi-t2/2 


dt  . 


(6.5) 


The  item  difficulty  parameter,  ,  is  given  by 

b8  *  *"l(1'P8t>  "g"1  • 


where  is  the  probability  with  which  the  examinee  answers 

item  g  correctly.  In  practice,  this  is  replaced  by  the 

frequency  ratio,  P  ^  ,  to  provide  us  with  the  estimate  of 

b  . 
g 

[E]  The  eleven  ability  scales  thus  constructed  are  considered  to 

be  on  the  same  continuum,  and  they  are  integrated  into  a  single 

scale.  This  equating  is  made  through  the  ten  subsets  of  items, 

each  of  which  is  shared  by  two  adjacent  tests.  Let  a^  and 

b^  be  the  item  parameters  estimated  from  the  result  of  the 

first  test,  and  a*  and  b*  be  those  from  the  result  of  the 
g  g 

second  test.  Denoting  the  two  ability  scales  by  0  and  0*  , 
respectively,  we  can  write 


(6.6)  a  (0-b  )  -  a*(0*-b*)  , 
g  g  g  g 

since  the  item  characteristic  functions,  which  follow  the 
normal  ogive  model,  of  the  same  item  g  on  the  two  ability 
scales  must  assume  the  same  value  for  the  corresponding  values 
of  0  and  0*  .  Thus  the  functional  relationship  between 
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0  and  0*  is  given  by 


7)  0*  -  (a  /a*)0  +  [b*-(a  / a*)b  ]  , 

g  g  g  g  g  g 


which  is  linear,  and  Che  two  coefficients  are  obtained  from 
these  four  parameters.  In  practice,  we  obtain  as  many  sets 
of  coefficients  as  the  number  of  common  items,  and  we  need  to 
use  some  type  of  "average"  of  these  coefficients  for  the  scale 
transformation.  Figure  6-2  presents  the  ability  distributions 
of  the  eleven  subject  groups  after  such  transformations  were 
made  and  the  mean  and  the  standard  deviation  of  the  distribution 
of  J1  are  taken  as  the  origin  and  the  unit  for  the  new, 
integrated  ability  dimension. 

[F]  The  item  characteristic  function  of  each  item  on  the  new, 
integrated  scale  0  is  approximated  by  the  logistic  function, 
which  is  given  by  (6.1). 

A 

[G]  The  maximum  likelihood  estimate,  0^  ,  of  each  examinee's 
ability  is  obtained  through  the  equation 


3)  Z  a  P  (0.)  -  Z  a  u 

g-l  8  g  j  g-1  g  gj 

(cf.  Birnbaum,  1968),  where  u^  is  the  binary  item  score  of 
individual  j  for  item  g  . 

[H]  The  test  information  function  of  each  test  is  obtained  by 


))  1(0)  -  Z  I  (0)  , 


where  I  (0)  is  the  item  information  function  of  item  g  such 
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chat 

(6.10)  Ig(0)  =  [PJ(0)]2[P  (0){1-P  (6)}]_1  . 

Figure  6-3  presents  the  test  information  functions  thus 
obtained  for  the  eleven  tests. 

[I]  The  theoretical  frequency  distribution  of  test  score  T  for 
each  test  and  examinee  group  can  be  written  as 

u  1-u 

(6.11)  N  £  £  P  (0)  8[1-P  (6)]  g  , 

VeT  u  eV  g  8 

g 

where  V  is  a  response  pattern  or  a  vector  of  n  item  scores, 
and  T  is  the  test  score  given  by 
n 

(6.12)  T  -  £  u  . 

g=l  8 

This  is  used  for  the  validation  of  the  model  and  assumptions 
adopted  in  the  process  of  analysis.  Figure  6-4  illustrates 
the  goodness  of  fit  of  this  theoretical  frequency  distribution 
of  test  score  to  the  actual  frequency  distribution,  for  Test 
Jl. 

[J]  The  sample  mean  of  the  maximum  likelihood  estimate  0  of  the 
subgroup  of  examinees,  who  selected  each  of  the  five  alternatives 
is  calculated,  for  each  item  of  each  test. 

[K]  A  tailored  test  of  the  vocabulary  is  constructed  by  selecting  an 
appropriate  subset  of  items  from  these  eleven  tests,  in  such 

a  way  that  an  individual  is  directed  to  a  next  item  which  is 
chosen  on  the  basis  of  the  sample  mean  of  9  of  the  alternative 


.rji 
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Information  Functions  of  the  Eleven  Tests  on  Vocabulary 
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he  has  selected  for  the  present  item. 

We  have  seen  in  the  preceding  paragraphs  a  brief  sketch  of  « 

Shiba  and  others'  work.  It  is  unfortunate  that  the  author  cannot 
convey  the  fine  quality  of  the  tests  themselves  to  the  reader,  for 
they  are  vocabulary  tests  and  their  translation  from  Japanese  into 
English  would  certainly  destroy  the  nature  of  the  tests.  We  can 
see  that  the  research  has  been  conducted  very  conscientiously, 
however,  including  several  processes  of  validation,  and  has  eventually 
produced  a  widely  applicable  vocabulary  scale  and  a  tailored  test. 

In  the  latter  result,  although  there  is  some  room  for  improvement, 
the  use  of  distractors  for  "branching"  subjects  should  be  taken 
as  a  stimulation  to  the  researchers  who  are  engaged  in  this  area, 
for  it  has  seldom  been  seriously  investigated  by  other  researchers. 

The  research  conducted  by  Shiba  and  others  includes  more 
interesting  data  than  were  used  in  the  vocabulary  scale  construction. 

Table  6-1  presents  a  part  of  them,  in  which  the  frequency 
distribution  of  the  alternative  selection  and  the  mean  of  the 
maximum  likelihood  estimate  of  ability  for  each  alternative  are 
shown  for  nineteen  items  included  in  both  Tests  J1  and  J2,  and 
administered  to  four  different  subject  groups,  JH1,  JH2(a),  JH2(b) 
and  JH3.  In  the  same  table,  also  presented  is  the  discrepancy 

A 

between  the  mean  of  6  for  the  correct  answer  and  the  lowest 
mean  9  for  one  of  the  four  wrong  answers,  under  the  heading, 

"largest  discrepancy."  The  correct  answers  are  always  identified 
as  the  ones  which  have  the  highest  means  of  9  ,  except  for  the  one 
for  item  3  administered  to  JH2(b),  which  is  the  second  highest 
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TABLE  6-1  (Continued):  Test  Jl,  Junior  High  School  Grade  2 


Item 

Indices 

i 

Alternative 

2  3 

4 

5 

Total 

Largest 

Discrepancy 

37 

Mean  0 
FRQ 

0.886 

269 

-0.215 

39 

-0.249 

39 

-0.312 

37 

0.028 

71 

455 

1.198 

38 

Mean  9 
FRQ 

39 

Mean  9 
FRQ 

0.384 

55 

0.186 

97 

0.083 

82 

-0.068 

50 

1.015 

166 

450 

1.083 

40 

Mean  § 
FRQ 

0.521 

61 

-0.133 

95 

0.109 

45 

0.802 

243 

-0.286 

14 

458 

1.088 

41 

Mean  9 
FRQ 

-0.553 

27 

-0.440 

13 

-0.173 

19 

-0.019 

47 

0.665 

355 

461 

1.218 

42 

Mean  § 
FRQ 

0.810 

257 

-0.426 

14 

0.348 

68 

-0.089 

67 

-0.201 

51 

457 

1.236 

43 

Mean  0 
FRQ 

-0.162 

10 

0.791 

312 

-0.578 

53 

0.142 

46 

-0.321 

37 

458 

1.369 

44 

Mean  0 
FRQ 

0.298 

65 

-0.145 

54 

-0.228 

15 

0.664 

291 

0.237 

31 

456 

0.892 

45 

Mean  0 
FRQ 

-0.124 

30 

0.139 

23 

-0.290 

79 

0.823 

299 

-0.469 

28 

459 

1.292 

46 

Mean  0 
FRQ 

0.849 

308 

-0.751 

25 

-0.263 

29 

-0.260 

90 

-0.072 

7 

459 

1.600 

47 

Mean  0 
FRQ 

-0.136 

43 

0.764 

302 

-0.119 

54 

-0.194 

30 

-0.001 

30 

459 

0.958 

48 

Mean  0 
FRQ 

0.483 

56 

0.262 

85 

-0.889 

38 

-0.086 

45 

0.871 

231 

455 

1.760 

49 

Mean  § 
FRQ 

0.050 

96 

-0.351 

16 

0.183 

19 

-0.419 

35 

0.756 

294 

460 

1.175 

50 

Mean  § 
FRQ 

0.798 

269 

0.153 

19 

-0.634 

20 

0.151 

84 

-0.099 

63 

455 

1.432 

51 

Mean  0 
FRQ 

0.118 

76 

-0.260 

47 

0.312 

55 

0.150 

68 

0.909 

202 

448 

1.169 

52 

Mean  9 
FRQ 

0.195 

60 

0.778 

239 

0.035 

71 

0.206 

21 

0.177 

58 

449 

0.74  3 

53 

Mean  0 
FRQ 

0.376 

94 

0.193 

34 

-0.013 

26 

0.918 

180 

0.040 

125 

459 

0.931 

54 

Mean  9 
FRQ 

0.817 

177 

0.256 

75 

0.282 

82 

0.221 

108 

0.051 

9 

451 

0.766 

55 

Mean  0 
FRQ 

-0.043 

20 

-0.042 

45 

-0.052 

174 

-0.455 

18 

1.157 

201 

458 

1.612 

56 

Mean  9 
FRQ 

0.256 

70 

0.236 

100 

-0.289 

80 

1.354 

128 

0.247 

77 

455 

1.643 
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TABLE  6-1  (Continued):  Teat  J2,  Junior  High  School  Grade  3 


Item  I  Indices 


Alternative 


Largest 

Discrepancy 


Kean  6 
FRQ 

Mean  9 
FRQ 

Mean  § 
FRQ 

Mean  0 
FRQ 

Mean  8 
FRQ 

Mean  § 
FRQ 

Mean  9 
FRQ 

Mean  § 
FRQ 

Mean  § 
FRQ 

Mean  0 
FRQ 


0.161  -0.838  -0.787  -1.099 

436  30  25  19 


573  1.260 


Mean  9 
FRQ 

Mean  0 
FRQ 

Mean  0 
FRQ 

Mean  9 
FRQ 

Mean  0 
FRQ 

Mean  8 
FRQ 

Mean  8 
FRQ 

Mean  0 
FRQ 

Mean  8 
FRQ 


VII  Use  of  Index  k*  When  Distractors  Are  in  Full  Work 

It  is  obvious  in  Table  6-1  of  the  preceding  chapter  that  for 

these  vocabulary  items  the  knowledge  or  random  guessing  principle  does 

not  work  behind  the  examinee's  behavior,  for  the  mean  values  of  9 

for  the  wrong  answers  are  substantially  different  from  one  another 

for  most  of  the  items.  In  cases  like  this,  index  k*  ,  which  was 

introduced  in  Chapter  5  as  a  modification  of  Sato's  number  of 

hypothetical,  equivalent  alternatives  and  used  as  an  index  for 

invalidating  three-parameter  models,  can  be  used  as  a  measure  of 

desirability  of  the  item  for  the  group  of  examinees  in  question, 

just  as  Sato's  index  is  meant  to  be  used  for.  An  additional  merit 

of  index  k*  when  it  is  used  for  this  purpose  will  be  that  it  can 

be  used  directly,  without  depending  upon  the  relationship  with  the 

probability  for  the  correct  answer,  p  ,  which  is  illustrated 

K 

by  Figure  2-1. 

Table  7-1  presents  the  estimated  entropy  H*  obtained 
by  (5.6),  for  each  of  the  nineteen  items  and  each  of  the  four 
groups  of  examinees,  JH1,  JH2(a),  JH2(b)  and  JH3.  The  values 
of  index  k*  ,  which  correspond  to  these  H*'s  in  Table  7-1, 
were  obtained  by  (5.10)  and  are  shown  in  Table  7-2. 

We  can  see  in  these  tables  that  thirteen  out  of  the  total 
of  nineteen  items  have  higher  values  of  H*  ,  and  hence  of  fc*  , 
for  JH2(a)  than  for  JH2(b).  Since  the  subjects  in  these  two 
groups  are  of  the  same  school  year,  i.e.,  the  second  year  of  junior 
high  school,  this  tendency  may  be  related  with  the  fact  that  for  JH2(a) 
these  nineteen  items  were  given  at  the  end  of  the  test  and  for 
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TABLE  7-1 

Entropy  of  Each  of  the  Nineteen  Vocabulary  Items  Based  on 
Each  of  the  Four  Subgroups,  i,e.,  Junior  High  School, 
Grades  1,  2,  2  and  3.  For  the  First  Two  Subgroups 
of  Subjects  Test  J1  Was  Used  and  for  the  Other  Two 
Subgroups  Test  J2  Was  Used. 


..Sub  group 

Item'" 

JH1 

JH2  (a) 

JH2 (b) 

JH3 

37  (1) 

1.55907 

1.57572 

1.51218 

1.52080 

39  (3) 

1.57359 

1.57997 

1.55547 

1.53566 

40  (4) 

1.41987 

1.48141 

1.39913 

1.46917 

41  (5) 

1.47880 

1.52098 

1.46885 

1.48496 

42  (6) 

1.50740 

1.51576 

1.50880 

1.42679 

43  (7) 

1.54070 

1.51224 

1.39256 

1.49871 

44  (8) 

1.43049 

1.51333 

1.41791 

1.47934 

45  (9) 

1.42195 

1.49895 

1.54177 

1.52485 

46(10) 

1.37234 

1.36152 

1.36544 

1.39912 

47(11) 

1.52673 

1.58391 

1.54137 

1.57599 

48(12) 

1.59254 

1.57072 

1.57317 

1.43130 

49(13) 

1.51299 

1.40124 

1.40700 

1.32933 

50(14) 

1.54630 

1.46214 

1.50665 

1.43095 

51(15) 

1.59962 

1.59600 

1.58320 

1.55950 

52(16) 

1.54651 

1.54903 

1.51294 

1.51407 

53(17) 

1.45244 

1.46629 

1.41821 

1.48312 

54(18) 

1.51192 

1.45933 

1.51052 

1.46054 

55(19) 

1.23002 

1.27989 

1.25075 

1.30371 

56(20) 

1.60838 

1.60223 

1.58595 

1.60504 
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TABLE  7-2 

Number  of  Hypothetical,  Equivalent  Alternatives  of  Each 
of  the  Nineteen  Vocabulary  Items  Based  on  Each  of  the 
Four  Subgroups,  i.e..  Junior  High  School,  Grades  1, 

2,  2  and  3.  For  the  First  Two  Subgroups  of  Subjects 
Test  J1  Was  Used  and  for  the  Other  Two  Subgroups 
Test  J2  Was  Used. 


Subgroup 

Item 

JH1 

JH2 (a) 

JH2 (b) 

JH3 

37  (1) 

4.75440 

4.83420 

4.53660 

4.57590 

39  (3) 

4.82391 

4.85480 

4.73730 

4.88252 

AO  (4) 

4.13659 

4.39917 

4.05166 

4.34565- 

41  (5) 

4.38768 

4.57672 

4.34425 

4.41479 

42  (6) 

4.51496 

4.55290 

4.52130 

4.16531 

43  (7) 

4.66784 

4.53688 

4.02513 

4.47592 

44  (8) 

4.18076 

4.54183 

4.12850 

4.39004 

45  (9) 

4.14519 

4.47701 

4.67284 

4.59447 

46(10) 

3.94459 

3.90212 

3.91744 

4.05162 

47(11) 

4.60310 

4.87397 

4.67098 

4.83551 

48(12) 

4.91623 

4.81011 

4.82191 

4.18412 

49(13) 

4.54029 

4.06023 

4.08370 

3.77850 

50(14) 

4.69408 

4.31519 

4.51161 

4.18267 

51(15) 

4.95113 

4.93326 

4.87053 

4.75646 

52(16) 

4.69506 

4.70690 

4.54008 

4.54521 

53(17) 

4.27352 

4.33314 

4.12972 

4.40669 

54(18) 

4.53542 

4.30307 

4.52908 

4.30829 

55(19) 

3.42128 

3.59625 

3.49295 

3.68295 

56(20) 

4.99472 

4.96410 

4.88392 

4.97805 

A 
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JH2(b)  they  were  given  at  the  beginning  of  the  test.  We  can  also 
observe  that,  for  some  items,  there  exists  a  mild  tendency  that 
the  value  of  fc*  becomes  greater  as  the  school  year  increases,  and, 
for  some  others,  this  tendency  is  reversed.  Items  39(3),  40(4), 
44(8),  45(9),  47(11),  53(17)  and  55(19)  belong  to  the  first  category, 
and  items  37(1),  48(12),  49(13),  50(14),  51(15),  52(16)  and  54(18) 
are  members  of  the  second  category.  In  spite  of  these  mild 
tendencies,  however,  the  values  of  index  k*  are  large,  ranging, 
approximately,  from  3.42  to  4.99  ,  for  all  the  examinee  groups, 
the  result  which  indicates  a  high  desirability  of  this  subset  of 
test  items  for  these  groups  of  examinees. 

We  can  observe  a  tendency  that,  regardless  of  the  groups  of 
examinees,  some  items  have  higher  values  of  £*  than  others,  and 
some  other  items  have  lower  values  of  £*  than  others.  Items 
56(20),  51(15)  and  39(3)  exemplify  the  first  category,  and  items 
55(19)  and  46(10)  are  members  of  the  second  category. 

The  mean  and  the  standard  deviation  of  the  nineteen  values 
of  k*  for  each  of  the  four  examinee  groups  were  computed,  and  are 
presented  in  Table  7-3.  We  can  see  that  all  the  mean  values  are 
between  4.39  and  4.51,  and  all  the  standard  deviations  are  between 
0.34  and  0.40,  i.e.,  very  close  to  one  another,  respectively. 

As  an  additional  information,  the  product-moment  correlation 
coefficient  of  £*'s  ,  which  are  shown  in  Table  7-2,  was  computed 
for  each  pair  of  examinee  groups,  and  the  result  is  presented  in 
Table  7-4.  We  can  see  that  these  values  are  fairly  large  and 
positive,  as  we  can  expect  from  Table  7-2. 
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TABLE  7-4 

Product-Moment  Correlation  Coefficient  of  the  Index 
k*  for  Each  Pair  of  the  Four  Examinee  Groups. 


JH1 

JH2 (a) 

JH2 (b) 

JH3 

JH1 

1.00000 

0.82705 

0.82711 

0.60447  ■ 

JH2  (a) 

0.82705 

1.00000 

0.85120 

0.85770 

JH2 (b) 

0.82711 

0.85120 

1.00000 

0.71444 

JH3 

0.60447 

0.85770 

0.71444 

1.00000 
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The  result  of  the  principal  factor  analysis  of  the  correlation 
matrix.  Table  7-4,  with  the  largest  correlation  coefficient  of  each 
row  or  column  as  the  first  estimate  of  the  communality  and  using 
three  iterative  reestimations  of  the  communalities ,  provides  us  with 
the  eigenvalues,  3.237,  0.266,  0.044  and  -0.011  .  Since  the 
correlation  matrix,  with  communalities  as  the  principal  diagonal 
elements,  is  positive  semi-definite,  the  negative  eigenvalue  is 
due  to  the  error,  resulting,  mainly,  from  the  inaccuracy  of  the 
estimation  of  the  communalities.  The  final  communality  estimates 
are  approximately  0.863,  0.999,  0.862  and  0.833,  respectively. 

We  can  say  from  this  result  that  a  strong,  dominating  general 
factor  exists  behind  the  four  sets  of  k*'s  ,  since  the  first 
eigenvalue,  3.237,  is  by  far  the  largest,  and  the  other  eigenvalues 
are  close  to  zero.  The  first  factor  loadings  for  the  four  examinee 
groups,  which  are  the  correlation  coefficients  between  this  general 
factor  and  the  separate  sets  of  £*'s  ,  respectively,  turned  out 
to  be  0.868,  0.983,  0.905  and  0.836  . 

These  facts  indicate  that  the  four  examinee  groups  are  fairly 
similar  to  one  another  with  respect  to  the  configuration  of  the 
values  of  ic*  as  far  as  these  nineteen  vocabulary  test  items  are 


concerned. 
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VIII  Proposal  of  £  New  Family  of  Models  for  the  Multiple-Choice 
Item 

Throughout  the  history  of  mental  measurement,  the  multiple- 
choice  item  has  been  treated  as  a  "poor  image  of  the  free-response 
item,"  and  very  little  accomplishment  has  been  made  in  pursuing 
its  theoretical  advantage,  rather  than  its  handicap.  Most 
researchers  in  these  days  mechanically  adopt  the  three-parameter 
logistic  model  for  their  research  which  is  based  on  the  multiple- 
choice  item,  without  even  trying  to  validate  the  model.  As  long  as 
they  continue  doing  this,  we  shall  never  be  able  to  expect  any 
progress  in  this  area  of  science,  in  spite  of  the  fact  that  more 
and  more  research  materials  and  published  papers  are  accumulated 
year  by  year. 

It  has  been  one  of  the  author's  purposes  of  pursuing  the 
method  of  estimating  the  operating  characteristics  without  assuming 
any  mathematical  model  a  priori  (Samejima,  1977b,  1977c,  1978a, 
1978b,  1978c,  1978d,  1978e,  1978f)  to  approach  the  operating 
characteristics  of  distractors,  which  are  completely  neglected 
by  the  users  of  three-parameter  models.  While  this  approach  is 
undoubtedly  more  scientific  than  any  others,  it  will  be  desirable 
to  consider  new  types  of  models,  which  reflect  psychological 
reality  behind  the  examinee's  behavior  in  the  multiple-choice 
situation  far  better  than  tvree-parameter  models  and  the  knowledge 
or  random  guessing  principle. 

The  research  on  the  vocabulary  measurement  made  by  Shiba 
and  others  should  be  credited  for  the  fact  that  they  did  not 
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accept  Che  fashionable  three-parameter  logistic  model  blindly  as 
many  other  researchers  do,  and,  moreover,  they  try  to  make  full 
use  of  the  information  given  by  the  distractors  to  the  extent  that 
they  used  it  for  branching  examinees  in  tailored  testing.  As  far 
as  we  treat  the  multiple-choice  item  as  a  binary  item,  it  will  be 
a  poor  substitute  for  the  free-response  item,  which  is  contaminated 
by  noise  or  guessing.  If  we  make  use  of  the  information  given 
by  the  distractors,  however,  the  multiple-choice  item  can  be  more 
informative  than  the  free-response  item,  and  will  no  longer  be  a 
poor  image  of  the  free-response  item. 

The  family  of  models  that  will  be  proposed  in  this  chapter 

is  related  with  the  graded  response  model  (Samejima,  1969,  1972), 

in  which  an  item  is  scored  into  more  than  two  response  categories. 

Let  Xg  be  the  graded  item  score,  which  assumes  integers,  0 

through  m  ,  and  P  (8)  be  its  operating  characteristic.  The 
g  Xg 

graded  response  level  can  be  classified  into  the  homogeneous  and 
the  heterogeneous  cases  (Samejima,  1972),  and  we  can  name  the 
normal  ogive  model  (Samejima,  1972,  1973)  and  the  logistic  model 
(Samejima,  1972)  as  models  in  the  homogeneous  case,  and  Bock's 
multi-nomial  response  model  (Bock,  1972,  Samejima,  1972)  as  an 
example  in  the  heterogeneous  case.  In  these  models,  the  operating 
characteristic  of  the  item  response  category  is  defined,  respectively, 
as  follows. 


(8.1)  Px  (9)  =  (2u) 

g 


-1/2  ( 


a  (9-b  ) 


g 


X 


g 


-u2/2 
e  du 


g'  bxg+l} 
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(8.2)  P  (8)  -■«  [I+exp{-D a  (9-b  )})  1  -  [l+exp{-Da  (0-b  .)}] 

g  8  Xg  8  Xg  1 

m  ^ 

(8.3)  P  (8)  -  exp{ a  6+6  }[  I8  exp{a  8+6  >]“  • 

x  rxx  s  s 

g  g  g  s-0 


In  both  the  normal  ogive  and  the  logistic  models,  i.e.,  in  (8.1)  and 

(8.2),  the  item  parameter  a  is  a  positive  number,  and  the  item 

g 

response  parameter  b  satisfies  the  relationship  such  that 

Xg 


(8.4) 


b_  <  b  <  b„  <  ...  <  b  <  b  , . 
0  12  m  m  +1 

g  g 


In  the  latter,  D  is  a  positive  number  which  assumes  1.7  when  the 

logistic  model  is  used  as  a  substitute  for  the  normal  ogive  model. 

In  Bock's  multi-nomial  model,  one  of  the  item  response  parameters, 

a  satisfies  the  inequality, 

Xg 


(8.5) 


an  *  $  a?  * 


$  a 


Suppose  that  the  multiple-choice  item  g  is  constructed  in 
such  a  way  that  all  the  main,  plausible  answers  are  covered  by  the 
alternatives,  in  addition  to  the  correct  answer.  Suppose,  further, 
that  no  guessing  is  involved  in  the  examinee's  behavior  in  answering 
item  g  .  Then  the  examinee  will  either  be  attracted  to  one  of 
the  alternatives,  or  will  have  no  idea  at  all  as  to  its  answer. 
Arrange  all  the  distrartors  in  the  order  of  their  plausibility, 
and  give  the  numbers  1  through  (m^-1)  in  the  ascending  order. 

The  number  assigned  to  the  correct  answer  is  m^  ,  or  m  for 
simplicity,  and  the  one  assigned  to  the  "no  idea  at  all"  category 
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is  0  ,  In  such  a  situation,  Che  operating  characteristic  of  the 
graded  response  category  can  be  used  as  the  operating  characteristic 
of  the  alternative,  treating  "no  answer"  as  the  additional  alternative, 
to  which  the  item  score  is  0  . 

In  practice,  however,  because  of  the  pressure  of  testing, 
it  is  rather  unlikely  that  the  examinee  will  leave  the  item  unanswered 
even  when  he  has  "no  idea  at  all."  For  this  reason,  now  we  shall 
assume  that  the  examinee  guesses  randomly  when  he  is  not  attracted 
by  the  plausibility  of  any  alternative.  Thus  we  shall  deal  with 
the  m  alternatives  as  the  graded  response  categories,  1  through 
m  ,  and  we  can  write  for  the  operating  characteristic  of  the 
alternative 


m 

(8.6)  Px  (9)  =  Yx  (9)  +  (l/mg)[l-i:8H's(0)]  ,  xg=l,2...,mg  , 

8  8  si 

where  V  (0)  is  the  operating  characteristic  of  the  alternative 
Xg 

which  is  numbered  xg  ,  when  no  guessing  is  involved.  Thus  we 

can  use  one  of  the  P  (9)'s  defined  by  (8.1),  (8.2)  and  (8.3), 

Xg 

or  a  similar  operating  characteristic  of  the  graded  response 

category  with  a  sound  rationale  behind  it,  depending  upon  the 

nature  of  the  item  and  the  set  of  alternatives. 

For  the  purpose  of  illustration,  we  shall  use  the  normal  ogive 

model  for  'f  (0)  ,  with  a  =  1.5  and  b  's  are  -2.0,  -1.0, 
x  g  x 

8  g 

0.0,  1.0  and  2.0  for  xg  =  1,2, 3, 4, 5  ,  respectively.  Figure  8-1 
presents  the  operating  characteristics  of  the  (mg+l)  alternatives, 
obtained  by  (8.1),  when  no  guessing  is  involved  and  "no  answer" 


is  treated  as  the  additional  alternative,  or  category  0  . 
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In  this  example,  the  operating  characteristics  of  the  four  distractors 

are  unimodal,  with  -1.5,  -0.5,  0.5  and  1.5  as  the  modal  points, 

respectively.  Figure  8-2  presents  the  operating  characteristics 

of  the  five  alternatives  when  guessing  is  involved,  which  are 

given  by  (8.6)  with  ^  (6)  replaced  by  P  (9)  given  in  (8.1). 

Xg  Xg 

We  can  see  that,  unlike  the  operating  characteristics  when  no  guessing 
is  involved,  these  curves  have  the  common  asymptote,  1/5  ,  when 
9  approaches  negative  infinity.  To  compare  the  two  operating 
characteristics  of  each  alternative  more  clearly.  Figure  8-3  presents 
the  two  curves  for  each  alternative  in  one  graph,  with  the  dotted 
line  for  the  one  without  guessing,  and  the  solid  line  for  the  one 
with  guessing. 

The  family  of  models  presented  by  (8.6)  seems  reasonable, 
in  the  sense  that  it  considers  both  the  information  given  by  the 
distractors  and  the  noise  caused  by  random  guessing.  Its  behavior 
will  be  investigated  further,  and  will  be  discussed  in  a  separate 
paper. 

It  is  interesting  to  note  that  the  use  of  the  normal  ogive 

model  and  its  logistic  approximation  in  the  research  on  vocabulary 

measurement  conducted  by  Shiba  and  others  can  be  justified  by 

the  new  family  of  models.  As  we  can  see  in  the  fifth  graph  of 

Figure  8-3,  when  the  parameter  b,  is  as  distant  from  b  as 

i  m 

g 

in  this  example,  the  operating  characteristic  of  the  correct  answer 
is  practically  the  same  as  the  item  characteristic  function  of 


the  normal  ogive  model  on  the  dichotomous  response  level,  except 
for  the  additional  "tail"  on  the  lower  levels  of  ability.  If 


Comparison  of  the  Two  Operating  Characteristics  in  the  Normal  Ogive  Model 
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d 


FIGURE  8-3  (Continued) 


<  c**5*  *. 
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this  is  the  case  with  all  the  items  in  the  test  and  the  ability 
distribution  of  our  examinees  does  not  include  lower  levels  of 
0  where  these  tails  lie,  we  can  approximate  the  operating 
characteristic  of  the  correct  answer  by  the  normal  ogive  model 
on  the  dichotomous  response  level,  and  use  the  tetrachoric  correlation 
coefficient  and  the  logistic  approximation  and  so  on,  just  as  Shiba 
and  others  did. 


J 
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IX  Discussion  and  Conclusions 

We  have  introduced  Sato's  number  of  hypothetical,  equivalent 
alternatives,  and  defined  its  modification,  index  k*  ,  as  a  measure 
of  invalidating  the  three-parameter  logistic,  or  normal  ogive,  model. 
We  have  also  introduced  Shiba's  research  on  the  measurement  of 
vocabulary  and  the  construction  of  a  tailored  test,  using  the 
information  given  by  distractors.  Various  observations  and 
discussion  have  been  made  concerning  the  three-parameter  models 
and  item  distractors,  the  validation  of  mathematical  models,  and 
so  forth.  Finally,  a  new  family  of  models  for  the  multiple- 
choice  item,  which  formulate  both  the  operating  characteristics  of 
distractors  and  the  effect  of  random  guessing,  has  been  proposed. 

There  is  a  tendency  that  researchers  restrict  their  ideas 
within  the  tradition  of  their  own  culture.  Thus  they  tend  to 
accept  whatever  is  familiar  to  them,  what  is  fashionable  among 
other  researchers  in  their  culture,  and  so  on,  without  feeling  the 
necessity  of  validating  the  ideas  and  mathematical  models  in 
relation  with  their  specific  data  and  psychological  reality. 

The  virtue  of  doubt  can  be  obtained  if  they  shift  their  attention 
to  what  is  going  on  outside  of  their  own  culture  and  climate, 
and  try  to  think  what  is  really  right. 

Three-parameter  models  for  the  multiple-choice  item  have 
been  too  readily  accepted  among  psychometricians  and  applied 
psychologists,  and  they  have  been  using  the  models  without  trying 
to  validate  them.  Unless  we  correct  this  wrong  orientation, 
psychology  will  never  make  any  progress,  regardless  of  the  fact 
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that  more  data  are  accumulated  and  more  papers  are  published  year 
by  year.  In  the  author's  opinion,  psychology  has  not  yet  established 
itself  as  a  science,  and  we  need  to  do  that  by  putting  ourselves  in 
a  right  track  of  research.  In  so  doing,  the  validation  of 
mathematical  models  is  certainly  one  of  the  most  important  things. 

The  departure  from  the  tradition  should  also  be  made  in  the 
treatment  of  the  multiple-choice  item.  Instead  of  trying  to  handle 
the  multiple-choice  item  as  a  "blurred"  substitute  for  the  free- 
response  item,  we  must  make  full  use  of  its  advantage,  which  the 
free-response  item  does  not  have.  The  operating  characteristics 
of  the  distractors  of  the  multiple-choice  item  will  add  more 
information  about  the  examinee’s  ability  level.  We  must  set 

a  criterion  for  the  quality  of  multiple-choice  items  from  this 
aspect  also. 
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TABLE  A- 2 

Frequency  Ratio,  Pj  ,  of  Each  of  the  Five  Alternatives  and  the 

Estimated  Probability,  P*  ,  for  the  Correct  Answer  with  Which 

R 

the  Examinee  Selects  the  Correct  Answer  by  Random  Guessing  at 
the  Maximum,  for  Each  of  the  Nineteen  Vocabulary  Items. 
Junior  High  School,  Grade  1,  for  Test  Jl. 


'l  «Od  P. 

I 

2 

Alternative 

3 

4 

5 

37  (1) 

RELAI  IVB  FREQUENCY 

MODIFIED  RtL.FREQ. 

0-50175 

0-13271 

0.00741 

0.10315 

0.10315 

0.20455 

39  (3) 

RELATIVE  FREQUENCY 

MUU 1 F IED  KtL.FKEQ. 

0.16192 

0.20463 

0.20996 

0.09075 

0.33274 

0.17450 

40  (4) 

RELATIVE  FREQUENCY 
MODIFIED  RtL.FREQ. 

0.10471 

0.24607 

0-15707 

0-47644 
0.  16692 

0-r 1571 

<•1  (5) 

0.09250 

0-03490 

0-04014 

0.  1*»634 

0. 68412 
0-09324 

42  (6) 

RELATIVE  FREQUENCY 

MODIFIED  KtL.FKfcy. 

0-43333 

0.16122 

0-03604 

0.21220 

0.14737 

0.17018 

43  (7) 

KEL Af I V£  FREwUENCY 
MODIFIED  RcL • FKEQ- 

0.U4545 

0.53846 

0-12503 

0-17133 

0.11713 

0.12762 

44  (8) 

0.20914 

0-11775 

0.02460 

0.50524 
0-  13040 

0.06327 

45  (9) 

RELATIVt  FREQUENCY 
M0UIFIE0  AEL.FREQ. 

0.08979 

0.04401 

0. 21655 

D.  6091 5 
0.12427 

0.04049 

46(10) 

RELAI TV E“ FREQUENCY 

MOU1F1ED  REL-  FREQ* 

0.52113 

0-16263 

0.00099 

0.07746 

0-20873 

0-03169 

47(11) 

R  EL  A  T  Tv  r” F  RTq U  ENC  Y 
MODIFIED  ReL.FAEQ- 

0  , 12127 

0.39267 

0.16628 

0.27760 

0.09  315 

0.11424 

48(12) 

‘^PTaT I VE  F* EQUcNLY 
MUDIF1E0  REL-FKEU. 

’’0VT4362 

'0.17720“ 

“0.162  Si- 

"6711879“ 

~0TV5T4~5“ 

0-13854 

49(13) 

MUDIFIEO  KcL-FKEQ. 

0,20070 

0.05410 

0.07336 

0-  12216 

0-54974 

0.12718 

50(14) 

ATIVE  FRECUetfCY 
MODIFIED  RtL.FREQ. 

— 0.53940  ' 
0,12468 

0,08056 

'0.06130“ 

5-15(561  0-16813 

51(15) 

~f  E  CYT  l  V VTPtQuc  NC  i~ 
MUulFlEJ  HEL.frttQ. 

Oi  15893“ 

0. 14643 

0-13393' 

0. 20  lT9  073T59T 
0.16225 

52(16) 

'RELATIvr FREgUfNCV  “ 
MODIFIED  RtL.FREQ. 

0,20531 

0.41593 

0.15807 

0.14159“ 

0.060TB  5717699 

53(17) 

R'CCATTV  £  FpTeVUcnC  y 
MOdIFIcd  RtL.FREQ- 

0,28497' 

0.C8516 

0*05944 

0. 25000 
0. 22911 

"0.31643 

54(18) 

k El  A ITVTTr kTqUc  NC Y 
MOOiFIEO  PEL. FREQ- 

0.32442 

0*19109 

0, 19786 

0.  17825 

0.25312— 0. 0463V 

55(19) 

RELAI I Vfc  FKtQULNCY 
MOUIF IE0  RtL-FKLQ- 

0.04729 

0. 126C9 

0.55517 

0.05079 

0.22067 

0.32187 

56(20) 

relative  Frequency 

MOi/lFIED  Rt  L  -Fm Ew  • 

"  0,18182 

0. 17657 

0.20105 

0.  2 *6 $0"' 
0.  18862 

'  0719406 ‘ 

TABLE  A-2  (Continued):  Junior  High  School,  Grade  2 
for  Test  Jl. 


a 

P,  and  P* 

1 

2 

Alternative 

3 

4 

5 

(1) 

REL  AT  1 VE  FRfcQ JENCY 
•WOIFIED  REL. FREQ. 

0.59121 

0-10662 

0.08571 

0.06571 

0.08132 

0.15604  j 

4 

(3) 

RELATIVE  FREQUENCY”  ' 
MODIFIED  KcL.FkEQ." 

"  0-12222 

0-21556 

”  0.18222  J 

o;  ii  ai” 

”0.36889" 

0.16372” 

(A) 

RELATIVE  FREQUENCY 

MOOIFIEO  REL. FREQ. 

0-13319 

0.20742 

0.09825 

0.53057 

0.13810 

0.03057  \ 

RELATIVE“FkEQUENCY 

'  S. 05857  “ 

0.02620 

0.04121 

0.  1 0 195 

”0.77007” 

(5)  ! 

■MODIFIED  KcL-FkEw- 

0.06429 

(6) 

RELATIVE  FREQUENCY 
HOOIFIED  RtL-FREQ. 

0-56236 

0.12318 

0.03063 

0.14880 

0. 14661 

4 

0.11160  ] 

..  .  j 

(7) 

RtLATIVE  FkEqUENCY 
MODIFIED  RcL.FREQ. 

0.02183 

0.68122 

0.09013 

0.11572 

0. 10044 

0.08079 

(8) 

REL AT  1  VE  Fk EQUENuY 
H001F1E0  RcL-FkEQ- 

0. 18254 

0. 11842 

0.03289 

0. 6381 6 
0.10216 

0.06798 

_  4 

(9) 

relative- frequency 

MODIFIED  RtL-FREQ. 

0.06536 

0.05011 

0.17211 

0.65142 
0.  10025 

_0.061GQ_ 

| 

10) 

Relative  frequency 

MODIFIED  KtL-FREw- 

0.67102 

0.11336 

0.05447 

0.06318 

0.  19608 

0.01525 

1 

11) 

TfEixnvE  ftiDucNcr 
MOulFItU  KcL-FKtw- 

0709368  0.65755 

0.03629 

0.11765 

0. 06556 

0.06536_ 

12) 

MOQ1F1EG  REL. FREQ. 

0.T23O8 

0.18681  0.08352 

0.09890 

0.50769 

0*12921 

13) 

"REC  AT  IVE- F«\E*  u  Eire  Y 
MODIFIED  RlL.FREQ. 

07208  70 

0.C3478 

0.04130 

0.07604 

0.O391T" 

0.11792J 

14) 

RELATIVE  FkEQucNCY 
HOuiFUU  RcL.FKcQ. 

07551 2 1- 
0.12331 

0.04176 

0.04398 

0. 18462” 

0.13646 

15) 

R£LSfT7E~Tk£wUcNLY 
MODIFIED  kt 4. •  FkEw- 

C. 16984 

0.10491 

0.12277 

O.I5T79 

0.45089 

0.13961 

18) 

TTEl  A  TIVEfR  EiJUl  NC  Y 
HUJlFIEU  RcL.FREQ. 

0.13363 

0.53229 

0.12617 

C. 15813 

0.06677“ 

0.17916 

17) 

KElAI  1  VE^FRt^UeNcY 
MODIFIED  RtL-FREQ- 

0.20479 

0.07407 

0.05684 

0.39216 0.2713a 

0. 18236  _ 

18) 

RELATIVE  FREQUENCY 
HOOIFIED  RtL-FREQ- 

C739245'” 

0-18393 

a.  16830 

0.16132 

0.23947 

01995^ 

19) 

"A~E  L  ATTv  E- FR  E  wU  t  NC  Y 
MODIFIED  KcL • Fk Cm • 

*~0.  043  6  7  ' 

0.09625 

0.37951  ' 

0.“0393d“ 

”0743866^ 

0.21613; 

20) 

RELATIVE  FS’EvJuENCY 

HLlwIFIEO  kcL.FKtw. 

0-15385  0-215  78 

"0.17582 

0.  28X32” 
0.16130 

0.16923 

TABLE  A-2  (Continued):  Junior  High  School,  Grade  2 
for  Test  J2. 


Itca 

?1  ?R 

Alternative 

1  2  3  4  5 

1(37) 

Rcl  Al I VE  FREQUENCY 

NOTIFIED  REl.FREO. 

0.65611  0.04977  0.08597 

0.09724 

0.04977  0.15837 

3(39) 

kELAl  1VE  rKcwUc-NCY 
MoJIFlEO  Rfcc.FREQ. 

0.12944  0.20642  0.19266 

0.07339  0.39908 

0*16079 

4(40) 

relative  frequency 

MUO IF IEO  REL.FREQ. 

0.23077  0.13575  0.10407 

0.52036  0.00905 

0.  15  717 

5(41) 

0.06335  0.00905  0.04525 

0.06145  0.80U90 

0.05953 

6(42) 

Bif  5' wi* M 

0.56818  0.02727  0.14545 

0.12263 

0.14545  0.11364 

7(43) 

RELATIVE  FaEJOcNCV 
MOOIFUO  HtL.EKEq. 

0.00452  0.69221  0.10860 

0. 10171 

0.13573  0.05882 

8(44) 

HElAflvE  F*£0U£f«CY 
rtUUlFlEO  RtL.F*Eg. 

Oji 6  7  42  0.05430  0.027i5 

0.70588  0.04525 

0.09401 

9(45) 

“RELaTIvC"  Ffitg-Jc NCY 
HOu IF  IfcJ  KtL.FKEvj* 

(5704545  0.04091  0.09545” 

0.76182  0.03636 

o.  05940  _ _ “ 

10(46) 

RELATIVE  FRe^uEncY 
MUOIFiEJ  RtL.FKEJ. 

5763801  0.04977  0.0814T- 

0.12408 

0.21267  0.01810 

11(47) 

"RELATIVE  F*Ejjcf»CY 
MGcHFltJ  AcE.FKcQ- 

5710030  0.65000  0.11816 

0.09534 

0.03182  0.10000 

12(48) 

RELATIVE  FREQUENCY 
MODIFIED  RcL.fPEQ. 

0. 10407  0.10407  0.04525 

0758 1 4 5  076631 6 

0.08761 

13(49) 

RTrarrv  t  7*  c7  ue  nc  y  ~ 

MuuIFIEl)  Rcl.FRcQ. 

*  0.22624  0.02715  0.05502 

0.056 62  073  92  76 

0.13206 

14(50) 

RcwATIVE  FR cQJENCY 
MUjIFIEO  RcL.FREQ. 

0.60909  0.05909  0.04091 

0. 1U32 

0.  15455^0.13636 

15(51) 

RELATIVE  TTt^oENCY 
MOuiFlbJ  RlL.FREQ. 

0.153O5  0.08145  0. 1 1 7o5  * 

0.lo290  0.48416 

0.13327 

16(52) 

ELAT  I  VF~  FREQUENCY  “ 
MODIFIED  KcL .FkcU* 

”0.16590  0.40092  0.24885 

0.  16923 

0.05069  37  1  3333 

17(53) 

R  t  CaTTV  E”~FR  t‘0  Uc  NX  V “ 

MGoiFItu  hcL.FkEw. 

0 . 2 3 5 24  0.09502  0.02262  * 

0. 364 67  3. 26244 

0.19663 

18(54) 

FEVATrVE- FPt3U€  f*LY  " 
HuylFUQ  RcL.FKcJ. 

0.49772  0.15068  0.14155' 

0.14233 

0.17608  0. 081^6“ 

19(55) 

kE’l  A  TVVE- FR’eJJENCY  ' 
HUO IF IEO  REL.FRtQ. 

0.02262”  0.  17195  0.39815  " 

0.03167"  0.37557“ 
0.25046 

20(56) 

KcLAT  tVE"FKfcwUcNtV" 
MOO  IF  led  RLL.FREQ. 

"”  0.20814  0.  17195  0.20814 

0.30317  07  1  0860" 

0.  17941 
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TABLE  A-2  (Continued):  Junior  High  School,  Grade  3, 
for  Test  J2 . 


RELATIVE  FREQUENCY 

MUJlFIEO  RcL.FREQ. 

0. 76091 
0.06686 

0.05236 

0.04363 

0.03316 

0.10995 

RELATIVE  FREQUENCY  ' 
MODIFIED  KcL.FhcU. 

0.09524 

0.16402 

0.17108 

0.11111“ 

0.4585S 

0.13946 

RELATIVE  FREQUENCY 
HQ01FIE0  KcL • fKcQ. 

0.14510 

0.13462 

0.06643 

0.63287 

0.10973 

0.02098 

RELATIVE  FREQUENCY  ~ 
NUUIF1E0  RtL.fKtg. 

0.05226 

0.01220 

0.03310 

0.07491 

0.82753 

0.05051 

RELATIVE  FREQUENCY 
HUUIFIEJ  RcL.FREQ. 

0.65317 

0.10957 

0.01232 

0.16901 

0.08627 

0.07923 

RELATIVE  FnkyUcNcY 
MUUIFIEO  RcL.FREQ. 

0.01754 

0. 77368 
0.06511 

0.07895 

0.08772 

0.04211 

1  t.T4^  i  rraa  *recw— 

0.09632 

0.05429 

C. 02452 

0.80035 

0.02452 

MOJlFlEU  KcL. FREQ. 

0.05889 

0.05614 

"0708246“ 

0.11754  ' 

0.71404 

0.02982 

HQJIFIEU  RcL.FREQ. 

0.07956 

5V82B3T' 

0.02627 

0.04934 

0.08932 

0.00701 

HUUIFIEU  RfcL.FREQ. 

0.05624 

Hc*.ATivt  frequency 
MOUlFIcO  RcL.FKtQ. 

0.06294 

0.68C07 

0.08341 

0.12063 

0.06119 

0.07517 

0.05769 

0.  15210 

0.01748 

0.06119 

0.71T5V 

MODIFIED  REL.FkEQ. 

0.09059 

RtlATTVc'  fKeWtNtY 

3718674 

0.02792 

0.05061 

0.02443 

oTTloTo 

MUJiFlfcD  '\cL.FnEQ. 

.  . 

0.10427 

C  l.-MPIH.WTI'HMM  1 

0.67776 

-  0.03853” 

0.02102  eri47n  onus* 

MUQIFIcU  RfcL.FREQ. 

0.10125 

*¥laT  i7FTR77uLNCY 

HUJlrltO  RcL.FREQ. 

0.11*43 

C. 04813 

c.  140  e"2 

0. 12299 

0.56863 

0.11483 

RTLATTVE^FrEwUcnCy 
MOJIFUO  RcL.FREQ. 

'“0.10193 

"  0.63972 
0.10162 

0.13181- 

0.0246(5” 

0.10193 

O'.  1  90  56 

“  0.07343 

‘0.35245 

0. 45280 

042357T" 

MG  J I F 1 c U  kcL.FkcQ. 

0. 16063 

-  -  - 

O'.  5  20 15  O'.  11504  0714159  0720354  0.01947 

MODIFIED  REL .FREQ. 

0.1449b 

RcLAt  TV E~ir  +  tli Uc nCY 
MOU IF IEO  kEL.FREW. 

'0.02622 

0.  15210 

0.23776 

0.  01575“  075661  a 
0.16095 

IbUTYfri  BJ  1  1  11  1  M 

5413589“ 

“Q.  12892 

-0.16202“ 

“o'.  40941  -  a;i637&“ 

| MUoIFlfcJ  RfcL.FKfcQ. 

0.14846 

-81- 


APPENDIX  III 


-82- 


RESEARCHERS  OF  THE  UNIVERSITY  OF  TOYKO 


Dr.  Sukeyori  Shiba 

1-8-3  Minami-Ogikubo,  Suginami-ku 

Tokyo  167,  Japan 

Mr.  Yukihiro  Noguchi 
1208-7  Kamoi,  Midori-ku 
Yokohama  226,  Japan 

Mr.  Tomokazu  Haebara 

Lindquist  Center  for  Measurement 

The  University  of  Iowa 

Iowa  City,  Iowa  52242 

U.S.A. 


TECHNOLOGICAL  RESEARCHERS  IN  EDUCATIONAL  MEASUREMENT  PROBLEMS 


1.  Educational  Technology  Group  of  IECE 

Dr.  Takahiro  Sato  (Representative)  Phone: 

Application  Research  Laboratory 
Central  Research  Laboratories 
Nippon  Electric  Co.,  Ltd. 

4-1-1  Miyazaki,  Takatsu-ku 
Kawasaki  213,  Japan 

Dr.  Hiroshi  Ikeda  Phone: 

Educational  Technology  Center 
Tokyo  Institute  of  Technology 

2- 12-1  Ohokayama,  Meguro-ku 
Tokyo  152 ,  Japan 

Dr.  Hideo  Fujiwara  Phone: 

Department  of  Electronic  Engineering 

Faculty  of  Engineering 

Osaka  University 

Yamadakami,  Suita-shi 

Osaka  565,  Japan 

Dr.  Keizo  Nagaoka  Phone: 

Educational  Technology  Center 
Department  of  Education 
Kohbe  University 

3- 11  Tsurukabuto,  Nada-ku 
Kohbe  657,  Japan 

Dr.  Yoneo  Yamamoto  Phone: 

Department  of  Information  Science 

Faculty  of  Engineering 

Tokushima  University 

2-1  Minamijosanj ima-cho 

Tokushima  770,  Japan 

(as  of  Jan.  4,  1979  —  University  of  Illinois,  CERL) 

Mr.  Makoto  Takeya  Phone: 

Application  Research  Laboratory 
Central  Research  Laboratories 
Nippon  Electric  Co.,  Ltd., 

4- 1-1  Miyazaki,  Takatsu-ku 
Kawasaki  213,  Japan 


S 


(044)  855-1111 
ext.  2296 


(03)  726-1111 


(06)  877-5111 


(078)  881-1212 


(0886)  23-2311 


(044)  855-1111 


-84-. 


Mr.  Masahiko  Kurata  Phone:  (044)  855-1111 

Application  Research  Laboratory 
Central  Research  Laboratories 
Nippon  Electric  Co.,  Ltd., 

4-1-1  Miyazaki,  Takatsu-ku 
Kawasaki  213,  Japan 

Mr.  Yasuhiro  Morimoto  Phone:  (044)  855-1111 

Application  Research  Laboratory 
Central  Research  Laboratories 
Nippon  Electric  Co.,  Ltd., 

4-1-1  Miyazaki,  Takatsu-ku 
Kawasaki  213,  Japan 

Mr.  Hiroyasu  Chimura  Phone:  (044)  855-1111 

Application  Research  Laboratory 
Central  Research  Laboratories 
Nippon  Electric  Co.,  Ltd., 

4-1-1  Miyazaki,  Takatsu-ku 
Kawasaki  213,  Japan 


2.  Others 
Dr.  Moriya  Oda 

Research  Institute  of  Cybernetics 
Department  of  Engineering 
Nagoya  University 
Furocho,  Chikusa-ku 
Nagoya  464,  Japan 

Dr.  Masahi  Ishiketa 
Department  of  Engineering 
Osaka  Denki-Tsushin  University 
18-8  Hatsu-cho,  Neyagawa-shi 
Osaka-fu  572,  Japan 

Mr.  Haruo  Nishinosono 
Educational  Technology  Center 
Kyoto  University  of  Education 
1  Fukakusa-Fujinomori-cho,  Fushimi-ku 
Kyoto  612,  Japan 

Dr.  Haruo  Sunouchi 

Department  of  Science  and  Engineering 
Waseda  University 
3-4-1  Ohkubo,  Shinjuku-ku 
Tokyo  160,  Japan 


Phone:  (052)  781-5111 


Phone:  (0720)  22-2161 


Phone:  (075)  641-9281 


Phone:  (03)  209-3211 


-85- 


Mr.  Hajirne  Yamashita  Phone:  (03)  203-4141 

Department  of  Politics  and  Economics 

Waseda  University 

1-6-1  Nishi-Waseda,  Shinjuku-ku 

Tokyo  160,  Japan 

Mr.  Masahiro  Yokoi  Phone:  (0427)  32-9111 

Department  of  Engineering 

Tamagawa  University 

6-1-1  Tamagawa-Gakuen,  Machida-shi 

Tokyo  194,  Japan 

Mr.  Takeshi  Kikukawa  Phone:  (0463)  58-1211 

Department  of  Communication  Engineering 

Faculty  of  Engineering 

Tokai  University 

1117  Kitakaname,  Hiratsuka-shi 

Kartagawa-ken  259-12,  Japan 

Dr.  Hiroichi  Fujita  Phone:  (044)  63-1141 

Department  of  Engineering 

Keio  Gijuku  University 

832  Hiyoshi-cho,  Kohoku-ku 

Yokohama  223, Japan 


PUBLICATIONS  BY  MEMBERS  OF  THE  EDUCATIONAL  TECHNOLOGY  GROUP  OF  THE 
INSTITUTE  OF  ELECTRONICS  AND  COMMUNICATION  ENGINEER?  IN  JAPAN 
(IECE) 


I  Papers  in  English 

[1]  Kurata,  M.  and  T.  Sato. 

Test  construction  system  applying  item  statistics. 
Proceedings  of  the  International  Conference  on 
Cybernetics  and  Society,  1978,  368-372. 

[2]  Sato,  T. 

Instructional  data  processing,  approach  to  computer 
managed  instruction.  NEC  Research  &  Development,  1973, 
29,  38-49. 

[3]  Sato,  T.  and  M.  Kurata. 

Basic  S-P  score  table  characteristics.  NEC  Research 
&  Development,  1977,  47,  64-71. 


II  Books  in  Japanese 

[4]  Sato,  T.  S-P  table  analysis:  analysis  and  interpretation 

of  test  scores.  Tokyo:  Meiji-Tosho  Publishing  Co.,  1975. 

[5]  Sato,  T.  (Ed.).  CMI  system  (computer  managed  instruction 

system) :  Uses  of  computers  in  education.  Denshi  Tsushin 
Gakkai  (Institute  of  Electronics  and  Communication 
Engineers  of  Japan)  »  1976. 

III  Other  Publications  in  Japanese 

[6]  Fujiwara,  H.  A  study  on  partitioning  of  S-P  tables  for  learning 

diagnosis.  Kodo  Keirvogaku,  (Japan  Behaviometrics)  ,  1979, 

6,1-9. 

[7]  Kurata,  M.  and  T.  Sato.  Simulation  and  analysis  of  the  item 

score  table  using  an  S-P  score  table  model.  Kodo  Keirvogaku, 
(Japan  Behaviometrics),  1976,  4,  11-17. 

[8]  Nagaoka,  K.  and  H.  Fujita.  Analysis  for  speaking  time  in 

discussion.  Kodo  Kelryogaku  (Japan  Behaviometrics)  ,  1978, 

6,  1-8. 

[9]  Sato,  T.  Engineering  techniques  in  the  analysis  of  educational 

data  V:  appl ication  of  entropy .  NEC  Central  Laboratories,  1977. 

[10]  Sato,  T.  Construction  of  a_  test  and  the  S-P  table .  NEC  Central 
Laboratories,  1978. 


[11]  Sato,  T.  Hierarchical  display  of  the  network  of  teaching 

elements  using  the  Interpretive  Structural  Modeling 
technique.  Denshi  Tsushin  Gakkai  (IECE),  Educational 
Technology,  1978,  4,  23-28. 

[12]  Sato,  T.  How  to  make  and  use  S-P  tables  as  a  method  of 

analyzing  the  test  result.  Shido  to  Hyoka  (Guidance 
and  Evaluation),  1978,  282,  44-51. 

[13]  Sato,  T.  Item  bank  system:  A  set  of  test  items  and  its 

cooperative  use.  Paper  presented  at  the  Third  Conference 
on  Educational  Technologies,  Naruto,  Tokushima,  1978. 

[14]  Sato,  T. ,  M.  Kurata  and  H.  Ikeda.  Estimation  of  statistical 

characteristics  of  the  educational  tests.  Denshi  Tsushin 
Gakkai  (IECE),  Educational  Technology,  1978,  2,  27-30. 

[15]  Sato,  T.  and  H.  Chimura.  Determination  of  hierarchical 

structure  of  instructional  units  using  the  Interpretive 
Structural  Modeling  method.  Denshi  Tsushin  Gakkai  (IECE), 
Educational  Technology ,  1979,  1,  11-16. 

[16]  Takeya,  M.  and  T.  Sato.  On  the  learning  progress  distribution 

of  programmed  instruction.  Kodo  Keiryogaku  (Japan 
Behaviometrics) ,  1975,  3,  12-21. 

[17]  Takeya,  M.  A  property  analysis  of  an  item  score  matrix 

in  CMI  systems.  Trans .  IECE,  1977,  60,  967-974. 

[18]  Takeya,  M.  Hierarchical  structure  analysis  among  instructional 

objectives  on  student  performance  scores.  Denshi 
Tsushin  Gakkai  (IECE),  Educational  Technology,  1978,  7, 

23-28. 

[19]  Takeya,  M.  Application  of  an  item  structure  analysis  to  an 

S-P  score  table.  Denshi  Tsushin  Gakkai  (IECE) , 

Educational  Technology,  1978,  12,  35-40. 

[20]  Takeya,  M.  On  an  expanded  item  relational  structure  analysis 

and  its  application.  Denshi  Tsushin  Gakkai  (IECE), 

Educational  Technology,  1979,  1,  23-26. 

[21]  Takeya,  M.  Comparison  of  the  item  relational  structure  analysis 

based  on  item  orderliness  with  other  methods.  Proceed ings 
of  the  Conference  of  Nippon  Kodo  Keirvo  Gakkai  (Japanese 
Behaviometr ic  Society),  1979,  102-105. 

[22]  Takeya,  M.  Item  relational  structure  analysis  based  on  performanc 

scores  for  educational  evaluation.  Trans .  IECE ,  1979,  62, 
451-458. 


7E, 


AO-A087  127  TENNESSEE  UNIV  KNOXVILLE  DEPT  OF  PSYCHOLOGY  F/G  5/9 

RESEARCH  ON  THE  MULTIPLE-CHOICE  TEST  ITEM  IN  JAPAN!  TOWARD  THE  — ETCCU) 
APR  80  F  SAMEJIMA  N0001A-77-C-0360 

UNCLASSIFIED  ONRT-M3  NL 


r -,i  Takeva  M.  Application  of  an  item  relational  structure  graph 
[23]  Take^;er^  structure  analysis  of  tests.  Trans.  2S» 

1979,  in  press. 

i24j 

..itur.  x.  j^ilagSafisBssi 

(IECE) ,  Educational  Technology,  1979,  1,  -»•• 


.  u.  s.  aovuMMorr  pomNc  ornci :  w  ‘»-M,/'W3 


